Celebrating 25 years of DDD's Excellence and Social Impact.

Generative AI

Red Teaming for GenAI

Red Teaming for GenAI: How Adversarial Data Makes Models Safer

Kevin Sahotsky

A generative AI model does not reveal its failure modes in normal operation. Standard evaluation benchmarks measure what a model does when it receives well-formed, expected inputs. They say almost nothing about what happens when the inputs are adversarial, manipulative, or designed to bypass the model’s safety training. The only way to discover those failure modes before deployment is to deliberately look for them. That is what red teaming does, and it has become a non-negotiable step in the safety workflow for any GenAI system intended for production use.

In the context of large language models, red teaming means generating inputs specifically designed to elicit unsafe, harmful, or policy-violating outputs, documenting the failure modes that emerge, and using that evidence to improve the model through additional training data, safety fine-tuning, or system-level controls. The adversarial inputs produced during red teaming are themselves a form of training data when they feed back into the model’s safety tuning.

This blog examines how red teaming works as a data discipline for GenAI, with trust and safety solutions and model evaluation services as the two capabilities most directly implicated in operationalizing it at scale.

Key Takeaways

  • Red teaming produces the adversarial data that safety fine-tuning depends on. Without it, a model is only as safe as the scenarios its developers thought to include in standard training.
  • Effective red teaming requires human creativity and domain knowledge, not just automated prompt generation. Automated tools cover known attack patterns; human red teamers find the novel ones.
  • The outputs of red teaming, documented attack prompts, model responses, and failure classifications, become training data for safety tuning when curated and labeled correctly.
  • Red teaming is not a one-time exercise. Models change after fine-tuning, and new attack techniques emerge continuously. Programs that treat red teaming as a pre-launch checkpoint rather than an ongoing process will accumulate safety debt.

Build the adversarial data and safety annotation programs that make GenAI deployment safe rather than just optimistic.

What Red Teaming Actually Tests

The Failure Modes Standard Evaluation Misses

Standard model evaluation measures performance on defined tasks: accuracy, fluency, factual correctness, and instruction-following. What it does not measure is robustness under adversarial pressure. The overview by Purpura et al. characterizes red teaming as proactively attacking LLMs with the purpose of identifying vulnerabilities, distinguishing it from standard evaluation precisely because its goal is to find what the model does wrong rather than to confirm what it does right. Failure modes that only appear under adversarial conditions include jailbreaks, where a model is induced to produce content its safety training should prevent; prompt injection, where malicious instructions embedded in user input override system-level controls; data extraction, where the model is induced to reproduce sensitive training data; and persistent harmful behavior that reappears after safety fine-tuning.

These failure modes matter operationally because they are the ones real-world adversaries will target. A model that performs well on standard benchmarks but succumbs to straightforward jailbreak techniques is not actually safe for deployment. The gap between benchmark performance and adversarial robustness is precisely the space that red teaming is designed to measure and close.

Categories of Adversarial Input

Red teaming for GenAI produces inputs across several categories of attack. Direct prompt injections attempt to override the model’s system instructions through user input. Jailbreaks use persona framing, fictional scenarios, or emotional manipulation to induce the model to bypass its safety training. Multi-turn attacks build context across a conversation to gradually shift model behavior in a harmful direction. Data extraction probes attempt to get the model to reproduce memorized training content. Indirect injections embed adversarial instructions within documents or retrieved content that the model processes. 

How Red Teaming Produces Training Data

From Attack to Dataset

The outputs of a red teaming exercise have two uses. First, they reveal where the model currently fails, informing decisions about deployment readiness, system-level controls, and the scope of additional training. Second, when curated and annotated correctly, they become the adversarial training examples that safety fine-tuning requires. A model cannot learn to refuse a jailbreak it has never been trained to recognize. The red teaming process generates the specific failure examples that the safety training data needs to include.

The curation step is critical and is where red teaming intersects directly with data quality. Raw red teaming outputs, attack prompts, and model responses need to be reviewed, classified by failure type, and annotated to indicate the correct model behavior. An attack prompt that produced a harmful response needs to be paired with a refusal response that correctly handles it. That pair becomes a safety training example. The quality of the annotation determines whether the safety training actually teaches the model what to do differently, or simply adds noise. Building generative AI datasets with human-in-the-loop workflows covers how iterative human review is structured to convert raw adversarial outputs into training-ready examples.

The Role of Diversity in Adversarial Datasets

The most common failure in red teaming programs is insufficient diversity in the adversarial examples generated. If all the jailbreak attempts follow similar patterns, the safety training data will be dense around those patterns but sparse around the full space of adversarial inputs the model will encounter in production. A model trained on a narrow set of attack patterns learns to refuse those specific patterns rather than learning generalized robustness to adversarial pressure. Effective red teaming programs deliberately vary the attack vector, framing, cultural context, language, and level of directness across their adversarial examples to produce safety training data with genuine coverage.

Human Red Teamers vs. Automated Approaches

What Automated Tools Can and Cannot Do

Automated red teaming tools generate adversarial inputs at scale by using one model to attack another, applying known jailbreak templates, fuzzing prompts systematically, or combining attack techniques programmatically. These tools are valuable for covering large input spaces rapidly and for regression testing after safety updates to check that previously patched vulnerabilities have not reappeared. The Microsoft AI red team’s review of over 100 GenAI products notes that specialist areas, including medicine, cybersecurity, and chemical, biological, radiological, and nuclear risks, require subject-matter experts rather than automated tools because harm evaluation itself requires domain knowledge that LLM-based evaluators cannot reliably provide.

Automated tools are also limited to the attack patterns they have been programmed or trained to generate. The most novel and damaging attack techniques tend to be discovered by human red teamers who approach the model with genuine adversarial creativity rather than systematic application of known patterns. A program that relies entirely on automation will develop good defenses against known attacks while remaining vulnerable to the class of novel attacks that automated systems did not anticipate.

Building an Effective Red Team

Effective red teaming programs combine specialists with different skill profiles: people with security backgrounds who understand attack methodology, domain experts who can evaluate whether a model response is genuinely harmful in the relevant context, people with diverse cultural and linguistic backgrounds who can identify failures that appear in non-English or non-Western cultural contexts, and generalists who approach the model as a motivated but non-expert adversary would. The diversity of the red team determines the coverage of the adversarial dataset. A red team drawn from a narrow demographic or professional background will produce adversarial examples that reflect their particular perspective on what constitutes a harmful input, which is systematically narrower than the full range of inputs the deployed model will encounter.

Red Teaming and the Safety Fine-Tuning Loop

How Adversarial Data Feeds Back Into Training

The standard safety fine-tuning workflow treats red teaming outputs as one of the key data inputs. Adversarial examples that elicit harmful model responses are paired with human-written refusal responses and added to the safety training dataset. The model is then retrained or fine-tuned on this expanded dataset, and the red teaming exercise is repeated to verify that the patched failure modes have been addressed and that the patches have not introduced new failures. This iterative loop between adversarial discovery and safety training is sometimes called purple teaming, reflecting the combination of the offensive red team and the defensive blue team. Human preference optimization integrates directly with this loop: the preference data collected during RLHF includes human judgments of adversarial responses, which trains the model to prefer refusal over compliance in the scenarios the red team identified.

Safety Regression After Fine-Tuning

One of the most significant challenges in the red teaming loop is safety regression: fine-tuning a model on new domain data or for new capabilities can reduce its robustness to adversarial inputs that it previously handled correctly. A model safety-tuned at one stage of development may lose some of that robustness after subsequent fine-tuning for domain specialization. This means red teaming is needed not just before initial deployment but after every significant fine-tuning operation. Programs that run red teaming once and then repeatedly fine-tune the model without re-testing are building up safety debt that will only become visible after deployment.

How Digital Divide Data Can Help

Digital Divide Data provides adversarial data curation, safety annotation, and model evaluation services that support red teaming programs across the full loop from adversarial discovery to safety fine-tuning.

For programs building adversarial training datasets, trust and safety solutions cover the annotation of red teaming outputs: classifying failure types, pairing attack prompts with correct refusal responses, and quality-controlling the adversarial examples that feed into safety fine-tuning. Annotation guidelines are designed to produce consistent refusal labeling across adversarial categories rather than case-by-case human judgments that vary across annotators.

For programs evaluating model robustness before and after safety updates, model evaluation services design adversarial evaluation suites that cover the range of attack categories the deployed model will face, stratified by attack type, cultural context, and domain. Regression testing frameworks verify that safety fine-tuning has addressed identified failure modes without degrading model performance on legitimate use cases.

For programs building the preference data that RLHF safety tuning requires, human preference optimization services provide structured comparison annotation where human evaluators judge model responses to adversarial inputs, producing the preference signal that trains the model to prefer safe behavior under adversarial pressure. Data collection and curation services build the diverse adversarial input sets that coverage-focused red teaming programs need.

Build adversarial data and safety-annotation programs that make your GenAI deployment truly safe. Talk to an expert.

Conclusion

Red teaming closes the gap between what a model does under normal conditions and what it does when someone is actively trying to make it fail. The adversarial data produced through red teaming is not incidental to safety fine-tuning. It is the input that safety fine-tuning most depends on. A model trained without adversarial examples will have unknown safety properties under adversarial pressure. A model trained on a well-curated, diverse adversarial dataset will have measurable robustness to the failure modes that dataset covers. The quality of the red teaming program determines the quality of that coverage.

Programs that treat red teaming as an ongoing discipline rather than a pre-launch checkbox build cumulative safety knowledge. Each red teaming cycle produces better adversarial data than the last because the team learns which attack patterns the model is most vulnerable to and can design the next cycle to probe those areas more deeply. The compounding effect of iterative red teaming, safety fine-tuning, and re-evaluation is a model whose adversarial robustness improves continuously rather than degrading as capabilities grow. Building trustworthy agentic AI with human oversight examines how this discipline extends to agentic systems where the safety stakes are higher, and the adversarial surface is larger.

Author:

Kevin Sahotsky, Head of Go-to-Market and Strategic Partnerships, Digital Divide Data

Kevin Sahotsky leads strategic partnerships and go-to-market strategy at Digital Divide Data, with deep experience in AI data services and annotation for physical AI, autonomy programs, and Generative AI use cases. He works with enterprise teams to navigate the operational complexity of production AI, helping them connect the right data strategy to real-world model performance. At DDD, Kevin focuses on bridging what organizations need from their AI data operations with the delivery capability, domain expertise, and quality infrastructure to make it happen.

References

Purpura, A., Wadhwa, S., Zymet, J., Gupta, A., Luo, A., Rad, M. K., Shinde, S., & Sorower, M. S. (2025). Building safe GenAI applications: An end-to-end overview of red teaming for large language models. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025) (pp. 335-350). Association for Computational Linguistics. https://aclanthology.org/2025.trustnlp-main.23/

Microsoft Security. (2025, January 13). 3 takeaways from red teaming 100 generative AI products. Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2025/01/13/3-takeaways-from-red-teaming-100-generative-ai-products/

OWASP Foundation. (2025). OWASP top 10 for LLM applications. OWASP GenAI Security Project. https://genai.owasp.org/

European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. https://artificialintelligenceact.eu/

Frequently Asked Questions

Q1. What is the difference between red teaming and standard model evaluation?

Standard evaluation measures model performance on expected, well-formed inputs. Red teaming specifically generates adversarial, manipulative, or policy-violating inputs to find failure modes that only appear under adversarial conditions. The goal of red teaming is to find what goes wrong, not to confirm what goes right.

Q2. How do red teaming outputs become training data?

Attack prompts that produce harmful model responses are paired with human-written correct refusal responses and added to the safety fine-tuning dataset. The model is then retrained on this expanded dataset, and the red teaming exercise is repeated to verify that patched failure modes have been addressed without introducing new ones.

Q3. Can automated red teaming tools replace human red teamers?

Automated tools are valuable for scale and regression testing, but are limited to known attack patterns. Human red teamers find novel attack methods that automated systems did not anticipate. Effective programs combine automated coverage of known attack patterns with human creativity for novel discovery. Domain-specific harms in medicine, cybersecurity, and other specialist areas require human expert evaluation that automated tools cannot reliably provide.

Q4. How often should red teaming be conducted?

Red teaming should be conducted before initial deployment and after every significant fine-tuning operation, because domain fine-tuning can reduce safety robustness. Programs that treat red teaming as a one-time pre-launch activity accumulate safety debt as the model is updated over its deployment lifetime.

Red Teaming for GenAI: How Adversarial Data Makes Models Safer Read Post »

Fine-Tuning

Instruction Tuning vs. Fine-Tuning: What the Data Difference Means for Your Model

When organisations begin building on top of large language models, two terms surface repeatedly: fine-tuning and instruction tuning. They are often used interchangeably, and that confusion is costly. The two approaches have different goals, require fundamentally different kinds of training data, and produce different types of model behaviour. Choosing the wrong one does not just slow a program down. It produces a model that fails to do what the team intended, and the root cause is almost always a misunderstanding of what data each method actually needs.

The distinction matters more now because the default starting point for most production programs has shifted. Teams are no longer building on raw base models. They are starting from instruction-tuned models and then deciding what to do next. That single decision shapes everything downstream: the format of the training data, the volume required, the annotation approach, and ultimately what the finished model can and cannot do reliably in production.

This blog examines instruction tuning and fine-tuning as distinct data problems, covering what each requires and how to decide which one your program needs. Human preference optimization and data collection and curation services are the two capabilities that determine whether either approach delivers reliable production performance.

Key Takeaways

  • Instruction tuning and domain fine-tuning are different interventions with different data requirements. Conflating them produces training programs that generate the wrong kind of model improvement.
  • Instruction tuning teaches a model how to respond to prompts. The data is a collection of diverse instruction-output pairs spanning many task types, and quality matters more than domain specificity.
  • Domain fine-tuning teaches a model what to know. The data is specialist content from a specific field, and coverage of that domain’s vocabulary, reasoning patterns, and conventions determines the performance ceiling.
  • Most production programs need both, applied in sequence: instruction tuning first to establish reliable behaviour, then domain fine-tuning to add specialist knowledge, then preference alignment to match actual user needs.
  • The most common data mistake is applying domain fine-tuning to a model that was never properly instruction-tuned, producing a model that knows more but follows instructions less reliably than before.

Common Data Mistakes and What They Produce

Using Domain Content as Instruction Data

One of the most frequent data design errors is building an instruction-tuning dataset from domain content rather than from task-diverse instruction-response pairs. A legal team, for example, assembles thousands of legal documents and treats them as fine-tuning data, hoping to produce a model that is both legally knowledgeable and instruction-following. The domain content teaches the model legal vocabulary and reasoning patterns. It does not teach the model how to respond to user instructions in a helpful, appropriately formatted way. The result is a model that sounds authoritative but does not reliably do what users ask.

Using Generic Instruction Data for Domain Fine-Tuning

The reverse mistake is using a publicly available general-purpose instruction dataset to attempt domain fine-tuning. Generic instruction data does not contain the specialist vocabulary, domain reasoning patterns, or domain-specific quality standards that make a model genuinely useful in a specialist field. A model fine-tuned on generic instruction examples will become slightly better at following generic instructions and no better at the target domain. 

The training data and the training goal must be aligned: domain fine-tuning requires domain data, and instruction tuning requires instruction-structured data. Text annotation services that structure domain content into an instruction-response format bridge the two requirements when a program needs both domain knowledge and instruction-following capability from the same dataset.

Neglecting Edge Cases and Refusals

Both instruction-tuning and fine-tuning programs commonly under-represent the edge cases that determine production reliability. Edge cases in instruction tuning are the ambiguous or potentially harmful instructions that the model will encounter in deployment. 

Edge cases in domain fine-tuning are the unusual domain scenarios that standard content collections underrepresent. In both cases, the model’s behaviour on the tail of the input distribution is determined by whether that tail was represented in training. Programs that evaluate only on the centre of the training distribution will consistently encounter production failures on inputs that were predictable edge cases.

What Each Method Is Actually Doing

Fine-Tuning: Adjusting What the Model Knows

Fine-tuning in its standard form takes a pre-trained model and continues training it on a new dataset. The goal is to shift the model’s internal knowledge and output distribution toward a target domain or task. As IBM’s documentation on instruction tuning explains, a pre-trained model does not answer prompts in the way a user expects. It appends text to them based on statistical patterns in its training data. Fine-tuning shapes what text gets appended and in what style, tone, and domain. The data requirement follows directly from this goal: fine-tuning data needs to represent the target domain comprehensively, which means coverage and authenticity matter more than the format of the training examples.

Full fine-tuning updates all model parameters, which gives the highest possible domain adaptation but requires significant compute and a large, high-quality dataset. Parameter-efficient approaches, including LoRA and QLoRA, update only a fraction of the model’s weights, making fine-tuning accessible on more constrained infrastructure while accepting some trade-off in maximum performance. The data requirements are similar regardless of the parameter efficiency method: the right domain content is still required, even if less compute is needed to train on it.

Instruction Tuning: Teaching the Model How to Respond

Instruction tuning is a specific form of fine-tuning where the training data consists of instruction-output pairs. The goal is not domain knowledge but behavioural alignment: teaching the model to follow instructions reliably, format outputs appropriately, and behave like a helpful assistant rather than a next-token predictor. The structured review characterises instruction tuning as training that improves a model’s generalisation to novel instructions it was not specifically trained on. The benefit is not task-specific but extends to the model’s overall instruction-following capability across any input it receives.

The data requirement for instruction tuning is therefore diversity rather than depth. A good instruction-tuning dataset spans many task types: summarisation, question answering, translation, classification, code generation, creative writing, and refusal of harmful requests. The examples teach the model a general pattern rather than specialist knowledge about any particular field. Breadth of task coverage matters more than the size of any single task category.

The Data Difference in Practice

What Fine-Tuning Data Looks Like

Domain fine-tuning data is the actual content of the target domain: clinical notes, legal contracts, financial research reports, engineering documentation, or customer service transcripts. The format can be relatively simple because the goal is to expose the model to the vocabulary, reasoning patterns, and conventions of the specialist field. What disqualifies data from being useful for fine-tuning is not format but relevance. Data that does not represent the target domain adds noise rather than signal, and data that represents the domain inconsistently teaches the model inconsistent patterns.

The quality threshold for fine-tuning data is specific. Factual accuracy is critical because a model fine-tuned on incorrect domain content will confidently produce incorrect domain outputs. Completeness of coverage matters because a legal model fine-tuned only on contract law will be unreliable on litigation or regulatory matters. Representativeness matters because if the fine-tuning data does not reflect the distribution of inputs the deployed model will receive, the model will perform well in training and poorly in production. AI data preparation services that assess coverage gaps and distribution alignment before fine-tuning begins prevent the most common version of this failure.

What Instruction-Tuning Data Looks Like

Instruction-tuning data is structured as instruction-response pairs, typically in a prompt-completion format where the instruction specifies what the model should do and the response demonstrates the correct behaviour. Quality requirements differ from domain fine-tuning in important ways. Factual correctness matters, but so does the quality of the instruction itself. 

A poorly written or ambiguous instruction teaches the model nothing useful about what good instruction-following looks like. Consistency in response format, tone, and the handling of edge cases matters because the model learns from the pattern across examples. Building generative AI datasets with human-in-the-loop workflows covers how instruction data is curated to ensure that examples collectively teach the right behavioural patterns rather than the individual habits of particular annotators.

The most consequential quality decision in instruction-tuning data concerns difficult cases: harmful instructions, ambiguous requests, and instructions that require refusing rather than complying. How refusal is modelled in the training data directly shapes the model’s refusal behaviour in production. Instruction-tuning programs that do not include carefully designed refusal examples produce models that either refuse too aggressively or not enough. Correcting this after training requires additional data and additional training cycles.

Why Most Programs Need Both, in the Right Order

The Sequence That Works

The most reliable architecture for production LLM programs combines instruction tuning and domain fine-tuning in sequence, not as alternatives. A base pre-trained model first undergoes instruction tuning to become a reliable instruction-following assistant. That instruction-tuned model then undergoes domain fine-tuning to acquire specialist knowledge. The order matters. Instruction tuning first establishes the foundational behaviour that domain fine-tuning should preserve rather than disrupt. 

Starting with domain fine-tuning on a raw base model often produces a model that knows more about the target domain but has lost the ability to follow instructions reliably, a failure mode known as catastrophic forgetting. Fine-tuning techniques for domain-specific language models examine how the sequence and data design at each stage determine whether domain specialisation is additive or disruptive to baseline model capability.

Where Preference Alignment Fits In

After instruction tuning and domain fine-tuning, the model knows how to respond and what to know. It does not yet know what users actually prefer among the responses it could produce. Reinforcement learning from human feedback closes this gap by training the model on human judgments of response quality. 

The preference data has its own specific requirements: it consists of comparison pairs rather than individual examples, it requires annotators who can make reliable quality judgments in the target domain, and the diversity of comparison pairs shapes the breadth of the model’s alignment. Human preference optimization at the quality level that production alignment requires is a distinct annotation discipline from both instruction data curation and domain content preparation.

Evaluating Whether the Data Worked

Evaluation Criteria Differ for Each Method

The evaluation framework for instruction tuning should measure instruction-following reliability across diverse task types: does the model produce the right output format, does it handle refusal cases correctly, does it remain consistent across paraphrased versions of the same instruction? Domain fine-tuning evaluation should measure domain accuracy, appropriate use of domain vocabulary, and correctness on the specific reasoning tasks the domain requires. Applying the wrong evaluation framework produces misleading results and misdirects subsequent data investment. Model evaluation services that design evaluation frameworks aligned to the specific goals of each training stage give programs the evidence they need to make reliable decisions about when a model is ready and where the next data investment should go.

When the Model Needs More Data vs. Different Data

The most common post-training question is whether poor performance indicates a volume problem or a data quality and coverage problem. More data of the same kind rarely fixes a coverage gap. It amplifies whatever patterns are already in the training set, including the gaps. A model that performs poorly on refusal cases needs more refusal examples, not more examples of the task types it already handles well. 

A domain fine-tuned model that misses rare but important domain scenarios needs examples of those scenarios, not additional examples of the common scenarios it already handles. Distinguishing volume problems from coverage problems requires error analysis on evaluation failures, not just aggregate metric tracking.

How Digital Divide Data Can Help

Digital Divide Data provides data collection, curation, and annotation services across the full LLM training stack, from instruction-tuning dataset design through domain fine-tuning content preparation and preference data collection for RLHF.

For instruction-tuning programs, data collection and curation services build task-diverse instruction-response datasets with explicit coverage of refusal cases, edge case instructions, and format diversity. Annotation guidelines are designed so that response quality is consistent across annotators, not just individually correct, because the model learns from the pattern across examples rather than from any single labeled instance.

For domain fine-tuning, text annotation services and AI data preparation services structure domain content into training-ready formats, audit coverage against the target deployment distribution, and identify the domain scenarios that standard content collections under-represent. Domain coverage analysis is conducted before training begins, not after the first evaluation reveals gaps.

For programs at the alignment stage, human preference optimization services provide structured comparison annotation with domain-calibrated annotators. Model evaluation services design evaluation frameworks that measure the right outcomes for each training stage, giving programs the signal they need to iterate effectively rather than optimising against the wrong metric.

Build LLM training programs on data designed for what each stage actually requires. Talk to an expert!

Conclusion

The data difference between instruction tuning and fine-tuning is not a technical detail. It is the primary design decision in any LLM customisation program. Instruction tuning teaches the model how to behave and needs diverse, well-structured task examples. Domain fine-tuning teaches the model what to know and needs accurate, representative domain content. Applying the data strategy designed for one to achieve the goal of the other produces a model that satisfies neither goal. Understanding the distinction before data collection begins saves programs from the most expensive form of rework in applied AI: retraining on data that was the wrong kind from the start.

Production programs that get this right treat each stage of the training stack as a distinct data engineering problem with its own quality requirements, coverage standards, and evaluation criteria. The programs that converge on reliable, production-grade models fastest are not those with the most data or the most compute. They are those with the clearest understanding of what their data needs to teach at each stage. Generative AI solutions built on data designed for each stage of the training stack are the programs that reach production reliably and perform there consistently.

References

Pratap, S., Aranha, A. R., Kumar, D., Malhotra, G., Iyer, A. P. N., & Shylaja, S. S. (2025). The fine art of fine-tuning: A structured review of advanced LLM fine-tuning techniques. Natural Language Processing Journal, 11, 100144. https://doi.org/10.1016/j.nlp.2025.100144

IBM. (2025). What is instruction tuning? IBM Think. https://www.ibm.com/think/topics/instruction-tuning

Savage, T., Ma, S. P., Boukil, A., Rangan, E., Patel, V., Lopez, I., & Chen, J. (2025). Fine-tuning methods for large language models in clinical medicine by supervised fine-tuning and direct preference optimization: Comparative evaluation. Journal of Medical Internet Research, 27, e76048. https://doi.org/10.2196/76048

Frequently Asked Questions

Q1. Is instruction tuning a type of fine-tuning?

Yes. Instruction tuning is a specific form of supervised fine-tuning where the training data consists of instruction-response pairs designed to improve the model’s general ability to follow user directives, rather than to add domain-specific knowledge. The distinction is in the goal and therefore in the data, not in the training mechanism.

Q2. How much data does instruction tuning require compared to domain fine-tuning?

Instruction tuning benefits more from the diversity of task coverage than from raw volume, and effective results have been demonstrated with carefully curated datasets of thousands to tens of thousands of examples. Domain fine-tuning volume requirements depend on how much specialist knowledge the model needs to acquire and on how well the domain is represented in the base model’s pretraining data.

Q3. What happens if you fine-tune a base model on domain data before instruction tuning?

Domain fine-tuning may improve the model’s domain knowledge but can disrupt its instruction-following capability, a failure mode known as catastrophic forgetting. The recommended sequence is to first tune instruction to establish reliable behavioural foundations, then fine-tune the domain to add specialist knowledge on top of that foundation.

Q4. Can you use the same dataset for both instruction tuning and domain fine-tuning?
A single dataset can serve both goals if it is structured as instruction-response pairs drawn from domain-specific content, combining task-diverse instructions with domain-accurate responses. This approach is more demanding to produce than either pure dataset type, but is efficient when both goals need to be addressed simultaneously. A practical example: a legal AI program might build a dataset where each entry pairs an instruction, such as summarise the key obligations in this contract clause, with a response written by a qualified legal reviewer. The instruction structure teaches the model to follow directives reliably. The domain-accurate legal response teaches it the vocabulary, reasoning, and precision required by the task. The same example serves both training goals, but only if the instructions are genuinely diverse across task types and the responses are reviewed for domain accuracy rather than generated at scale without expert validation.

Instruction Tuning vs. Fine-Tuning: What the Data Difference Means for Your Model Read Post »

computer vision retail

Retail Computer Vision: What the Models Actually Need to See

What is consistently underestimated in retail computer vision programs is the annotation burden those applications create. A shelf monitoring system trained on images captured under one store’s lighting conditions will fail in stores with different lighting. A product recognition model trained on clean studio images of product packaging will underperform on the cluttered, partially occluded, angled views that real shelves produce. 

A loss prevention system trained on footage from a low-footfall period will not reliably detect the behavioral patterns that appear in high-footfall conditions. In every case, the gap between a working demonstration and a reliable production deployment is a training data gap.

This blog examines what retail computer vision models actually need from their training data, from the annotation types each application requires to the specific challenges of product variability, environmental conditions, and continuous catalogue change that make retail annotation programs more demanding than most. Image annotation services and video annotation services are the two annotation capabilities that determine whether retail computer vision systems perform reliably in production.

Key Takeaways

  • Retail computer vision models require annotation that reflects actual in-store conditions, including variable lighting, partial occlusion, cluttered shelves, and diverse viewing angles, not studio or controlled-environment images.
  • Product catalogue change is the defining annotation challenge in retail: new SKUs, packaging redesigns, and seasonal items require continuous retraining cycles that standard annotation workflows are not designed to sustain efficiently.
  • Loss prevention video annotation requires behavioral labeling across long, continuous footage sequences with consistent event tagging, a fundamentally different task from product-level image annotation.
  • Frictionless checkout systems require fine-grained product recognition at close range and from arbitrary angles, with annotation precision requirements significantly higher than shelf-level inventory monitoring.
  • Active learning approaches that concentrate annotation effort on the images the model is most uncertain about can reduce annotation volume while maintaining equivalent model performance, making continuous retraining economically viable.

The Product Recognition Challenge

Why Retail Products Are Harder to Recognize Than They Appear

Product recognition at the SKU level is among the most demanding fine-grained recognition problems in applied computer vision. A single product category may contain hundreds of SKUs with visually similar packaging that differ only in flavor text, weight descriptor, or color accent. The model must distinguish a 500ml bottle of a product from the 750ml version of the same product, or a low-sodium variant from the regular variant, based on packaging details that are easy for a human to read at close range and nearly impossible to distinguish reliably from a shelf-distance camera angle with variable lighting. The visual similarity between related SKUs means that annotation must be both granular, assigning correct SKU-level labels, and consistent, applying the same label to the same product across all its appearances in the training set.

Packaging Variation Within a Single SKU

A single product SKU may appear in multiple packaging variants that are all legitimately the same product: regional packaging editions, promotional packaging, seasonal limited editions, and retailer-exclusive variants may all carry different visual appearances while representing the same product in the inventory system. A model trained only on standard packaging images will misidentify promotional variants, creating phantom out-of-stock detections for products that are present but packaged differently. Annotation programs need to account for packaging variation within SKUs, either by grouping variants under a shared label or by labeling each variant explicitly and mapping variant labels to canonical SKU identifiers in the annotation ontology.

The Continuous Catalogue Change Problem

The most distinctive challenge in retail computer vision annotation is catalogue change. New products are introduced continuously. Existing products are reformulated with new packaging. Seasonal items appear and disappear. Brand refreshes change the visual identity of entire product lines. Each of these changes requires updating the model’s knowledge of the product catalogue, which in a production deployment means retraining on new annotation data. Data collection and curation services that integrate active learning into the annotation workflow make continuous catalogue updates economically sustainable rather than a periodic annotation project that falls behind the rate of catalogue change.

Annotation Requirements for Shelf Monitoring

Bounding Box Annotation for Product Detection

Product detection in shelf images requires bounding box annotations that precisely enclose each product face visible in the image. In dense shelf layouts with products positioned side by side, the boundaries between adjacent products must be annotated accurately: bounding boxes that overlap into adjacent products will teach the model incorrect spatial relationships between products, degrading both detection accuracy and planogram compliance assessment. Annotators working on dense shelf images must make consistent decisions about how to handle partially visible products at the edges of the image, products occluded by price tags or promotional materials, and products where the facing is ambiguous because of the viewing angle.

Planogram Compliance Labeling

Beyond product detection, planogram compliance annotation requires that detected products are labeled with their placement status relative to the reference planogram: correctly placed, incorrectly positioned within the correct shelf row, on the wrong shelf row, or out of stock. This label set requires annotators to have access to the reference planogram for each store format, to understand the compliance rules being enforced, and to apply consistent judgment when product placement is ambiguous. Annotators without adequate training in planogram compliance rules will produce inconsistent compliance labels that teach the model incorrect decision boundaries between compliant and non-compliant placement.

Lighting and Environment Variation in Training Data

Shelf images collected under consistent controlled lighting conditions produce models that fail when deployed in stores with different lighting setups. Fluorescent lighting, natural light from store windows, spotlighting on promotional displays, and low-light conditions in refrigerated sections all create different visual characteristics in the same product packaging. 

Training data needs to cover the range of lighting conditions the deployed system will encounter, which typically requires deliberate data collection from multiple store environments rather than relying on a single-location dataset. AI data preparation services that audit training data for environmental coverage gaps, including lighting variation, viewing angle distribution, and store format diversity, identify the specific collection and annotation investments needed before a model can be reliably deployed across a retail network.

Annotation Requirements for Loss Prevention

Behavioral Event Annotation in Long Video Sequences

Loss prevention annotation is fundamentally a video annotation task, not an image annotation task. Annotators must label behavioral events, including product pickup, product concealment, self-checkout bypass actions, and extended dwell at high-value displays, within continuous video footage that may contain hours of unremarkable background activity for every minute of annotatable event. The annotation challenge is to identify event boundaries precisely, to assign consistent event labels across annotators, and to maintain the temporal context that distinguishes genuine suspicious behavior from normal customer behavior that superficially resembles it.

Video annotation for behavioral applications requires annotation workflows that are specifically designed for temporal consistency: annotators need to label the start and end of each behavioral event, maintain consistent individual tracking identifiers across camera cuts, and apply behavioral category labels that are defined with enough specificity to be applied consistently across annotators. Video annotation services for physical AI describe the temporal consistency requirements that differentiate video annotation quality from frame-level image annotation quality.

Class Imbalance in Loss Prevention Training Data

Loss prevention training datasets face a severe class imbalance problem. Genuine theft events are rare relative to the total volume of customer interactions captured by store cameras. A model trained on data where theft events represent a tiny fraction of total examples will learn to classify almost everything as non-theft, achieving high overall accuracy while being useless as a loss prevention tool. Addressing this imbalance requires deliberate data curation strategies: oversampling of theft events, synthetic augmentation of event footage, and training strategies that weight the minority class appropriately. The annotation program needs to produce a class-balanced dataset through curation rather than assuming that passive data collection from store cameras will produce a usable class distribution.

Privacy Requirements for Loss Prevention Data

Loss prevention computer vision operates on footage of real customers in a physical retail environment. The data governance requirements differ from product recognition: annotators are working with footage that may identify individuals, and the annotation process itself creates a record of individual customer behavior. Retention limits on identifiable footage, anonymization requirements for training data that will be shared across retail locations, and access controls on annotation systems processing this data are all governance requirements that need to be built into the annotation workflow design rather than added as compliance checks after the data has been collected and annotated. Trust and safety solutions applied to retail AI annotation programs include the data governance and anonymization infrastructure that satisfies GDPR and equivalent privacy regulations in the jurisdictions where the retail system is deployed.

Frictionless Checkout: The Highest Precision Bar

Why Checkout Recognition Requires Finer Annotation Than Shelf Monitoring

In shelf monitoring, a product misidentification produces an incorrect inventory record or a missed out-of-stock alert. In frictionless checkout, a product misidentification produces a billing error: a customer is charged for the wrong product, or a product is not charged at all. The business and reputational consequences are qualitatively different, and the annotation precision requirement reflects this. 

Bounding boxes on product images for checkout recognition must be tighter than for shelf monitoring. The product category taxonomy must be more granular, distinguishing SKUs at the level of size, flavor, and variant that affect price. And the training data must include the close-range, hand-occluded, arbitrary-angle views that checkout cameras capture when customers pick up and put down products.

Multi-Angle and Occlusion Coverage

A product picked up by a customer in a frictionless checkout environment will be visible in the camera feed from multiple angles as the customer handles it. The model needs training examples that cover the full range of orientations in which any product can appear at close range: front, back, side, top, bottom, and partially occluded by the customer’s hand at each orientation. Collecting and annotating training data that covers this multi-angle requirement for every product in a large assortment is a substantial annotation investment, but it is the investment that determines whether the system charges customers correctly rather than producing billing disputes that undermine the frictionless experience the system was built to create.

Handling New Products at Checkout

Frictionless checkout systems encounter products the model has never seen before, either because they are new to the assortment or because they are products brought in by customers from other retailers. The system needs a defined behavior for unrecognized products: queuing them for human review, routing them to a fallback manual scan option, or flagging them as unresolved in the transaction. The annotation program needs to include training examples for this unrecognized product handling behavior, not just for the canonical recognized assortment. Human-in-the-loop computer vision for safety-critical systems describes how human review integration into automated vision systems handles the ambiguous cases that model confidence alone cannot reliably resolve.

Managing the Annotation Lifecycle in Retail

The Retraining Cycle and Its Annotation Economics

Unlike many computer vision applications, where the object set is relatively stable, retail computer vision programs operate in an environment of continuous change. A grocery retailer introduces hundreds of new SKUs annually. A fashion retailer’s entire product catalogue changes seasonally. A convenience store network conducts quarterly planogram resets that change the product mix and layout across all locations. Each of these changes creates a gap between what the deployed model knows and what the real-world retail environment looks like, and closing that gap requires annotation of new training data on a timeline that matches the rate of change.

Active Learning as the Structural Solution

Active learning addresses the annotation economics problem by directing annotator effort toward the images that will most improve model performance rather than uniformly annotating every new product image. For catalogue updates, this means annotating the product images where the model’s confidence is lowest, rather than annotating all available images of new products. Data collection and curation services that integrate active learning into the retail annotation workflow make the continuous retraining cycle sustainable at the pace that catalogue change requires.

Annotation Ontology Management Across a Retail Network

Retail annotation programs that operate across a large network of store formats, regional markets, and product ranges face ontology management challenges that single-location programs do not. The product taxonomy needs to be consistent across all annotation work so that a product annotated in one store format is labeled identically to the same product annotated in a different format. 

Label hierarchies need to accommodate both store-level granularity for planogram compliance and network-level granularity for cross-store analytics. And the taxonomy needs to be maintained as a living document that is updated when products are added, removed, or relabeled, with change propagation to the annotation teams working across all active annotation projects.

How Digital Divide Data Can Help

Digital Divide Data provides image and video annotation services designed for the specific requirements of retail computer vision programs, from product recognition and shelf monitoring to loss prevention and checkout behavior, with annotation workflows built around the continuous catalogue change and multi-environment deployment that retail programs demand.

The image annotation services capability covers SKU-level product recognition with bounding box, polygon, and segmentation annotation types, planogram compliance labeling, and multi-angle product coverage across the viewing conditions and packaging variations that retail deployments encounter. Annotation ontology management ensures label consistency across assortments, store formats, and regional markets.

For loss prevention and behavioral analytics programs, video annotation services provide behavioral event labeling in continuous footage, temporal consistency across frames and camera transitions, and anonymization workflows that satisfy the privacy requirements of in-store footage. Class imbalance in loss prevention datasets is addressed through deliberate curation and augmentation strategies rather than accepting the imbalance that passive collection produces.

Active learning integration into the retail annotation workflow is available through data collection and curation services that direct annotation effort toward the catalogue items where model performance gaps are largest, making continuous retraining sustainable at the pace retail catalogue change requires. Model evaluation services close the loop between annotation investment and production model performance, measuring accuracy stratified by product category, lighting condition, and store format to identify where additional annotation coverage is needed.

Build retail computer vision training data that performs across the full range of conditions your stores actually present. Talk to an expert!

Conclusion

The computer vision applications transforming retail, from shelf monitoring and loss prevention to frictionless checkout and customer analytics, share a common dependency: they perform reliably in production only when their training data reflects the actual conditions of the environments they are deployed in. 

The gap between a working demonstration and a reliable deployment is almost always a training-data gap, not a model-architecture gap. Meeting that gap in retail requires annotation programs that cover the full diversity of product appearances, lighting environments, viewing angles, and behavioral scenarios the deployed system will encounter, and that sustain the continuous annotation investment that catalogue change requires.

The annotation investment that makes retail computer vision programs reliable is front-loaded but compounds over time. A model trained on annotation that genuinely covers production conditions requires fewer correction cycles, performs equitably across the store network rather than only in the flagship locations where pilot data was collected, and handles catalogue changes without the systematic accuracy degradation that programs which treat annotation as a one-time exercise consistently experience.

Image annotation and video annotation built to the quality and coverage standards that retail computer vision demands are the foundation that separates programs that scale from those that remain unreliable pilots.

References

Griffioen, N., Rankovic, N., Zamberlan, F., & Punith, M. (2024). Efficient annotation reduction with active learning for computer vision-based retail product recognition. Journal of Computational Social Science, 7(1), 1039-1070. https://doi.org/10.1007/s42001-024-00266-7

Ou, T.-Y., Ponce, A., Lee, C., & Wu, A. (2025). Real-time retail planogram compliance application using computer vision and virtual shelves. Scientific Reports, 15, 43898. https://doi.org/10.1038/s41598-025-27773-5

Grand View Research. (2024). Computer vision AI in retail market size, share and trends analysis report, 2033. Grand View Research. https://www.grandviewresearch.com/industry-analysis/computer-vision-ai-retail-market-report

National Retail Federation & Capital One. (2024). Retail shrink report: Global shrink projections 2024. NRF.

Frequently Asked Questions

Q1. Why do retail computer vision models trained on studio product images perform poorly in stores?

Studio images capture products under ideal controlled conditions that differ from in-store reality in lighting, viewing angle, partial occlusion, and surrounding clutter. Models trained only on studio imagery learn a visual distribution that does not match the production environment, producing systematic errors in the conditions that stores actually present.

Q2. How does product catalogue change affect retail computer vision programs?

New SKU introductions, packaging redesigns, and seasonal items continuously create gaps between what the deployed model recognizes and the current product assortment. Each change requires retraining on new annotated data, making annotation a recurring operational cost rather than a one-time development investment.

Q3. What annotation type does a loss prevention computer vision system require?

Loss prevention requires behavioral event annotation in continuous video footage: labeling the start and end of theft-related behaviors within long sequences that contain predominantly unremarkable background activity, with consistent temporal identifiers maintained across camera transitions.

Q4. How does active learning reduce annotation cost in retail computer vision programs?

Active learning concentrates annotation effort on the product images where the model’s confidence is lowest, rather than uniformly annotating all new product imagery. Research on retail product recognition demonstrates this approach can achieve 95 percent of full-dataset model performance with 20 to 25 percent of the annotation volume.

Retail Computer Vision: What the Models Actually Need to See Read Post »

audio annotation

Audio Annotation for Speech AI: What Production Models Actually Need

Audio annotation for speech AI covers a wider territory than most programs initially plan for. Transcription is the obvious starting point, but production speech systems increasingly need annotation that goes well beyond faithful word-for-word text. 

Speaker diarization, emotion and sentiment labeling, phonetic and prosodic marking, intent and entity annotation, and quality metadata such as background noise levels and speaker characteristics are all annotation types that determine what a speech AI system can and cannot do in deployment. Programs that treat audio annotation as a transcription task and add the other dimensions later, under pressure from production failures, pay a higher cost than those that design the full annotation requirement from the start.

This blog examines what production speech AI models actually need from audio annotation, covering the full range of annotation types, the quality standards each requires, the specific challenges of accent and language diversity, and how annotation design connects to model performance at deployment. Audio annotation and low-resource language services are the two capabilities where speech model quality is most directly shaped by annotation investment.

Key Takeaways

  • Transcription alone is insufficient for most production speech AI use cases; speaker diarization, emotion labeling, intent annotation, and quality metadata are each distinct annotation types with their own precision requirements.
  • Annotation team demographic and linguistic diversity directly determines whether speech models perform equitably across the full user population; models trained predominantly on data from narrow speaker demographics systematically underperform for others.
  • Paralinguistic annotation, covering emotion, stress, prosody, and speaking style, requires human annotators with specific expertise and structured inter-annotator agreement measurement, as these dimensions involve genuine subjectivity.
  • Low-resource languages face an acute annotation data gap that compounds at every level of the speech AI pipeline, from transcription through diarization to emotion recognition.

The Gap Between Benchmark Accuracy and Production Performance

Domain-Specific Vocabulary and Model Failure Modes

Domain-specific terminology is one of the most consistent sources of ASR failure in production deployments. A general-purpose speech model that handles everyday conversation well may produce high error rates on medical terms, legal language, financial product names, technical abbreviations, or industry-specific acronyms that appear infrequently in general-purpose training data. 

Each of these failure modes requires targeted annotation investment: transcription data drawn from or simulating the target domain, with domain vocabulary represented at the density at which it will appear in production. Data collection and curation services designed for domain-specific speech applications source and annotate audio from the relevant domain context rather than relying on general-purpose corpora that systematically under-represent the vocabulary the deployed model needs to handle.

Transcription Annotation: The Foundation and Its Constraints

What High-Quality Transcription Actually Requires

Transcription annotation converts spoken audio into written text, providing the core training signal for automatic speech recognition. The quality requirements for production-grade transcription go well beyond phonetic accuracy. Transcripts need to capture disfluencies, self-corrections, filled pauses, and overlapping speech in a way that is consistent across annotators. 

They need to handle domain-specific vocabulary and proper nouns correctly. They need to apply a consistent normalization convention for numbers, dates, abbreviations, and punctuation. And they need to distinguish between what was actually said and what the annotator assumes was meant, a distinction that becomes consequential when speakers produce grammatically non-standard or heavily accented speech.

Verbatim transcription, which captures what was actually said, including disfluencies, and clean transcription, which normalizes speech to standard written form, produce different training signals and are appropriate for different applications. Speech recognition systems trained on verbatim transcripts are better equipped to handle naturalistic speech. Systems trained on clean transcripts may perform better on formal speech contexts but underperform on conversational audio. The choice is a design decision with downstream model behavior implications, not an annotation default.

Timestamps and Alignment

Word-level and segment-level timestamps, which record when each word or phrase begins and ends in the audio, are required for applications including meeting transcription, subtitle generation, speaker diarization training, and any downstream task that needs to align text with audio at fine time resolution. Forced alignment, which uses an ASR model to assign timestamps to a given transcript, can automate this process for clean audio. 

For noisy audio, overlapping speech, or audio where the automatic alignment is unreliable, human annotators must produce or verify timestamps manually. Building generative AI datasets with human-in-the-loop workflows is directly applicable here: the combination of automated pre-annotation with targeted human review and correction of alignment errors is the most efficient approach for timestamp annotation at scale.

Speaker Diarization: Who Said What and When

Why Diarization Is a Distinct Annotation Task

Speaker diarization assigns segments of an audio recording to specific speakers, answering the question of who is speaking at each moment. It is a prerequisite for any speech AI application that needs to attribute statements to individuals: meeting summarization, customer service call analysis, clinical conversation annotation, legal transcription, and multi-party dialogue systems all depend on accurate diarization. The annotation task requires annotators to identify speaker change points, handle overlapping speech where multiple speakers talk simultaneously, and maintain consistent speaker identities across a recording, even when a speaker is silent for extended periods and then resumes.

Diarization annotation difficulty scales with the number of speakers, the frequency of turn-taking, the amount of overlapping speech, and the acoustic similarity of speaker voices. In a two-speaker interview with clean audio and infrequent interruption, automated diarization performs well, and human annotation mainly serves as a quality check. In a multi-party meeting with frequent interruptions, background noise, and acoustically similar speakers, human annotation remains the only reliable method for producing accurate speaker attribution.

Diarization Annotation Quality Standards

Diarization error rate, which measures the proportion of audio incorrectly attributed to the wrong speaker, is the standard quality metric for diarization annotation. The acceptable threshold depends on the application: a meeting summarization tool may tolerate higher diarization error than a legal transcription service where speaker attribution has evidentiary consequences. 

Annotation guidelines for diarization need to specify how to handle overlapping speech, what to do when speaker identity is ambiguous, and how to manage the consistent speaker label assignment across long recordings with interruptions and re-entries. Healthcare AI solutions that depend on accurate clinical conversation annotation, including distinguishing clinician speech from patient speech, require diarization annotation standards calibrated to the clinical documentation context rather than general meeting transcription.

Emotion and Sentiment Annotation: The Subjectivity Challenge

Why Emotional Annotation Requires Structured Human Judgment

Emotion recognition from speech requires training data where audio segments are labeled with the emotional state of the speaker: anger, frustration, satisfaction, sadness, excitement, or more fine-grained states, depending on the application. The annotation challenge is that emotion is inherently subjective and that different annotators will categorize the same audio segment differently, not because one is wrong but because the perception of emotional expression carries genuine ambiguity. A speaker who sounds mildly frustrated to one annotator may sound neutral or slightly impatient to another. This inter-annotator disagreement is not noise to be eliminated through adjudication; it is information about the inherent uncertainty of the annotation task.

Annotation guidelines for emotion recognition need to define the emotion taxonomy clearly, provide worked examples for each category, including boundary cases, and specify how disagreement should be handled. Some programs use majority-vote labels where the most common annotation across a panel becomes the ground truth. Others preserve the full distribution of annotator labels and use soft labels in training. Each approach encodes a different assumption about how emotional perception works, and the choice has implications for how the trained model handles ambiguous audio at inference time.

Dimensional vs. Categorical Emotion Annotation

Emotion annotation can be categorical, assigning audio segments to discrete emotion classes, or dimensional, rating audio on continuous scales such as valence from negative to positive and arousal from low to high energy. Categorical annotation is more intuitive for annotators and more straightforwardly usable in classification training, but it forces a discrete boundary where the underlying phenomenon is continuous. Dimensional annotation captures the continuous nature of emotional expression more accurately, but is harder to produce reliably and harder to use directly in classification tasks. The choice between approaches should be made based on the downstream model requirements, not on which is easier to annotate.

Sentiment vs. Emotion: Different Tasks, Different Signals

Sentiment annotation, which labels speech as positive, negative, or neutral in overall orientation, is related to but distinct from emotion annotation. Sentiment is easier to annotate consistently because the three-way distinction is less ambiguous than multi-class emotion categories. For applications like customer service quality monitoring, where the business question is whether a customer is satisfied or dissatisfied, sentiment annotation is the appropriate task. 

For applications that need to distinguish between specific emotional states, such as detecting customer frustration versus customer confusion to route to different intervention types, emotion annotation is required. Human preference optimization data collection for speech-capable AI systems needs to capture sentiment dimensions alongside response quality dimensions, as the emotional valence of a model’s response is as important as its factual accuracy in conversational contexts.

Paralinguistic Annotation: Beyond the Words

What Paralinguistic Features Are and Why They Matter

Paralinguistic features are properties of speech that carry meaning independently of the words spoken: speaking rate, pitch variation, voice quality, stress patterns, pausing behavior, and non-verbal vocalizations such as laughter, sighs, and hesitation sounds. These features convey emphasis, uncertainty, emotional state, and pragmatic intent in ways that transcription cannot capture. A speech AI system trained only on transcription data will be blind to these dimensions, producing models that cannot reliably identify when a speaker is being sarcastic, emphasizing a particular point, or signaling uncertainty through vocal hesitation.

Paralinguistic annotation is technically demanding because the features it captures are not visible in the audio waveform without domain expertise. Annotators need either acoustic training or sufficient familiarity with the target language and speaker population to reliably identify paralinguistic cues. Inter-annotator agreement on paralinguistic labels is typically lower than for transcription or sentiment, which means that the quality assurance process needs to specifically measure agreement on paralinguistic dimensions and investigate disagreements rather than treating them as simple annotation errors.

Non-Verbal Vocalizations

Non-verbal vocalizations, including laughter, crying, coughing, breathing artifacts, and filled pauses such as hesitation sounds, are annotation categories that matter for building conversational AI systems that can respond appropriately to human speech in its full natural form. Standard transcription conventions either ignore these vocalizations or represent them inconsistently. Speech models trained on data where non-verbal vocalizations are absent or inconsistently labeled will produce models that mishandle the segments of audio they appear in. The low-resource languages in the AI context compound this problem: the non-verbal vocalization conventions that are common in one language or culture may differ significantly from another, and annotation guidelines developed for one language community do not transfer without adaptation.

Intent and Entity Annotation for Conversational AI

From Transcription to Understanding

Spoken language understanding, the task of extracting meaning from transcribed speech, requires annotation beyond transcription. Intent annotation identifies the goal of an utterance: is the speaker requesting information, issuing a command, expressing a complaint, or performing some other speech act? 

Entity annotation identifies the specific items the utterance refers to: the dates, names, products, locations, and domain-specific terms that carry the semantic content of the request. Together, intent and entity annotation provide the training signal for the dialogue systems, voice assistants, and customer service automation tools that form the large commercial segment of speech AI.

Intent and entity annotation is a natural language understanding task applied to transcribed speech, with the additional complication that the transcription may contain errors, disfluencies, and incomplete sentences that make the annotation task harder than it would be for clean written text. Annotation guidelines need to specify how to handle transcription errors when they affect intent or entity identification, and whether to annotate based on what was said or what was clearly meant.

Custom Taxonomies for Domain-Specific Applications

Domain-specific conversational AI systems require intent and entity taxonomies tailored to the application context. A healthcare voice assistant needs intent categories and entity types specific to clinical workflows. A financial services voice system needs entity types that capture financial products, account actions, and regulatory classifications. 

Applying a generic intent taxonomy to a domain-specific application produces models that classify correctly within the generic categories while missing the distinctions that matter for the specific deployment context. Text annotation expertise in domain-specific semantic labeling transfers directly to spoken language understanding annotation, as the linguistic analysis required is equivalent once the transcription layer has been handled.

Speaker Diversity and the Representation Problem

How Annotation Demographics Shape Model Performance

Speech AI models learn from the audio they are trained on, and their performance reflects the speaker population that population represents. A model trained predominantly on audio from native English speakers in North American accents will perform well for that population and systematically worse for speakers with different accents, different dialects, or different native language backgrounds. This is not a modelling limitation that can be overcome with a better architecture. It is a training data problem that can only be addressed by ensuring that the annotation corpus represents the speaker population the model will serve.

The bias compounds across annotation stages. If the transcription annotators predominantly speak one dialect, their transcription conventions will encode that dialect’s phonological expectations. If the emotion annotators come from a narrow demographic background, their emotion labels will reflect that background’s emotional expression norms. Annotation team composition is a data quality variable with direct model performance implications, not a separate diversity consideration.

Accent and Dialect Coverage

Accent and dialect coverage in audio annotation corpora requires intentional design rather than emergent diversity from large-scale data collection. A large corpus of English audio collected from widely available sources will over-represent certain regional varieties and under-represent others, producing models that perform inequitably across the English-speaking world. 

Designing accent coverage into the data collection protocol, recruiting speakers from targeted geographic and demographic backgrounds, and annotating accent and dialect metadata explicitly are all practices that produce more equitable model performance. Low-resource language services address the most acute version of this problem, where entire language communities are absent from or severely underrepresented in standard speech AI training corpora.

Children’s Speech and Elderly Speech

Speech models trained predominantly on adult speech from a narrow age range perform systematically worse on children’s speech and elderly speech, both of which have acoustic characteristics that differ from typical adult speech in ways that standard training corpora do not cover adequately. 

Children speak with higher fundamental frequencies, less consistent articulation, and age-specific vocabulary. Elderly speakers may exhibit slower speaking rates, increased disfluency, and voice quality changes associated with aging. Applications targeting these populations, including educational technology for children and assistive technology for older adults, require annotation corpora that specifically cover the acoustic characteristics of the target age group.

Audio Quality Metadata: The Often Overlooked Annotation Layer

Why Quality Metadata Improves Model Robustness

Audio annotation programs that capture metadata about recording conditions alongside the primary annotation labels produce training datasets with information that enables more sophisticated model training strategies. Signal-to-noise ratio estimates, background noise type labels, recording environment classifications, and microphone quality indicators allow training pipelines to weight examples differently, sample more heavily from underrepresented acoustic conditions, and train models that are more explicitly robust to the acoustic degradation patterns they will encounter in production.

Trust and safety evaluation for speech AI applications also benefits from quality metadata annotation. Models deployed in conditions where audio quality is consistently poor may produce transcriptions with higher error rates in ways that interact with content safety filtering, producing either false positives or false negatives in safety classification that a quality-aware model could avoid. Recording quality metadata provides the context that allows safety-aware speech models to calibrate their confidence appropriately to the audio conditions they are operating in.

Recording Environment and Background Noise Classification

Background noise classification, which labels audio segments by the type and level of environmental interference, produces a training signal that helps models learn to be robust to specific noise categories. A customer service speech model that is trained on audio labeled by noise type, including telephone channel noise, call center background chatter, and mobile network artifacts, learns representations that are more specific to the noise conditions it will encounter than a model trained on undifferentiated noisy audio. This specificity pays dividends in production, where the model is more likely to encounter the specific noise patterns it was trained to be robust to.

How Digital Divide Data Can Help

Digital Divide Data provides audio annotation services across the full range of annotation types that production speech AI programs require, from transcription through diarization, emotion and sentiment labeling, paralinguistic annotation, intent and entity extraction, and audio quality metadata.

The audio annotation capability covers verbatim and clean transcription with domain-specific vocabulary handling, word-level and segment-level timestamp alignment, speaker diarization including overlapping speech annotation, and non-verbal vocalization labeling. Annotation guidelines are developed for each project context, not applied from a generic template, ensuring that the annotation reflects the specific acoustic conditions and vocabulary distribution of the target deployment.

For speaker diversity requirements, data collection and curation services source audio from speaker populations that match the intended deployment demographics, with explicit accent, dialect, age, and gender coverage targets built into the collection protocol. Annotation team composition is managed to match the speaker diversity requirements of the corpus, ensuring that transcription conventions and emotion labels reflect the linguistic and cultural norms of the target population.

For programs requiring paralinguistic annotation, emotion labeling, or sentiment classification, structured annotation workflows include inter-annotator agreement measurement on subjective dimensions, disagreement analysis, and guideline refinement cycles that converge on the annotation consistency that model training requires. Model evaluation services provide independent evaluation of trained speech models against production-representative audio, linking annotation quality investment to deployed model performance.

Build speech AI training data that closes the gap between benchmark performance and production reliability. Talk to an expert!

Conclusion

The gap between speech AI benchmark performance and production reliability is primarily an annotation problem. Models that excel on clean, curated test sets fail in production when the training data did not cover the acoustic conditions, speaker demographics, vocabulary distributions, and non-transcription annotation dimensions that the deployed system actually encounters. Closing that gap requires audio annotation programs that go well beyond transcription to cover the full range of signal dimensions that speech AI systems need to interpret: speaker identity, emotional state, paralinguistic cues, intent, entity content, and the acoustic quality metadata that allows models to calibrate their behavior to the conditions they are operating in.

The investment in comprehensive audio annotation is front-loaded, but the returns compound throughout the model lifecycle. A speech model trained on annotations that cover the full production distribution requires fewer retraining cycles, performs more equitably across the user population, and handles production edge cases without the systematic failure modes that narrow annotation programs produce. Audio annotation designed around the specific requirements of the deployment context, rather than the convenience of the annotation process, is the foundation of reliable production speech AI.

References

Kuhn, K., Kersken, V., Reuter, B., Egger, N., & Zimmermann, G. (2024). Measuring the accuracy of automatic speech recognition solutions. ACM Transactions on Accessible Computing, 17(1), 25. https://doi.org/10.1145/3636513

Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech and Language, 72, 101317. https://doi.org/10.1016/j.csl.2021.101317

Frequently Asked Questions

Q1. Why does speech AI performance drop significantly between benchmarks and production?

Standard benchmarks use clean, professionally recorded audio from narrow speaker demographics, while production audio includes background noise, diverse accents, domain-specific vocabulary, and naturalistic speech conditions that models have not been trained to handle if the annotation corpus did not cover them.

Q2. What annotation types are needed beyond transcription for production speech AI?

Production speech AI typically requires speaker diarization for multi-speaker attribution, emotion and sentiment labeling for conversational context, paralinguistic annotation for prosody and non-verbal cues, intent and entity annotation for spoken language understanding, and audio quality metadata for noise robustness training.

Q3. How does annotation team diversity affect speech model performance?

Annotation team demographics influence transcription conventions, emotion label distributions, and implicit quality standards in ways that encode the team’s linguistic and cultural norms into the training data, producing models that perform more reliably for speaker populations that resemble the annotation team.

Q4. What is the difference between verbatim and clean transcription, and when should each be used?

Verbatim transcription captures speech exactly as produced, including disfluencies, self-corrections, and filled pauses, producing models better suited to naturalistic conversation. Clean transcription normalizes speech to standard written form, producing models better suited to formal speech contexts but less robust to conversational input.

Audio Annotation for Speech AI: What Production Models Actually Need Read Post »

Human-in-the-Loop

When to Use Human-in-the-Loop vs. Full Automation for Gen AI

The framing of human-in-the-loop versus full automation is itself slightly misleading, because the decision is rarely binary. Most production GenAI systems operate on a spectrum, applying automated processing to high-confidence, low-risk outputs and routing uncertain, high-stakes, or policy-sensitive outputs to human review. The design question is where on that spectrum each output category belongs, which thresholds trigger human review, and what the human reviewer is actually empowered to do when they enter the loop.

This blog examines how to make that decision systematically for generative AI programs, covering the dimensions that distinguish tasks suited to automation from those requiring human judgment, and how human involvement applies differently across the GenAI development lifecycle versus the inference pipeline. Human preference optimization and trust and safety solutions are the two GenAI capabilities where human oversight most directly determines whether a deployed system is trustworthy.

Key Takeaways

  • Human-in-the-loop (HITL) and full automation are not binary opposites; most production GenAI systems use a spectrum based on output risk, confidence, and regulatory context.
  • HITL is essential at three lifecycle stages: preference data collection for RLHF, model evaluation for subjective quality dimensions, and safety boundary review at inference.
  • Confidence-based routing, directing low-confidence outputs to human review, only works if the model’s stated confidence is empirically validated to correlate with its actual accuracy.
  • Active learning concentrates human annotation effort on the outputs that most improve model performance, making HITL economically viable at scale.

The Fundamental Decision Framework

Four Questions That Determine Where Humans Belong

Before assigning any GenAI task to full automation or to an HITL workflow, four questions need to be answered. 

First: what is the cost of a wrong output? If errors are low-stakes, easily correctable, and reversible, the calculus favors automation. If errors are consequential, hard to detect downstream, or irreversible, the calculus favors human review. 

Second: how well-defined is correctness for this task? Tasks with verifiable correct answers, like code that either passes tests or does not, can be automated more reliably than tasks where quality requires contextual judgment.

Third: how consistent is the model’s performance across the full distribution of inputs the task will produce? A model that performs well on average but fails unpredictably on specific input types needs human oversight targeted at those types, not uniform automation across the board. 

Fourth: Does a regulatory or compliance framework impose human accountability requirements for this decision type? In regulated domains, the answer to this question can override the purely technical assessment of whether automation is capable enough.

The Spectrum Between Full Automation and Full Human Review

Most production systems implement neither extreme. Each point on this spectrum makes a different trade-off between throughput, cost, consistency, and the risk of undetected errors. The right point differs by task category, even within a single deployment. Treating the decision as binary and applying the same oversight level to every output type wastes reviewer capacity on low-risk outputs while under-protecting high-risk ones.

Distinguishing Human-in-the-Loop from Human-on-the-Loop

In a HITL design, the human actively participates in processing: reviewing, correcting, or approving outputs before they are acted on. In a human-on-the-loop design, automated processing runs continuously, and humans set policies and intervene when aggregate metrics signal a problem. Human-on-the-loop is appropriate for lower-stakes automation where real-time individual review is impractical. Human-in-the-loop is appropriate where individual output quality matters enough to justify the latency and cost of per-item review. Agentic AI systems that take real-world actions, covered in depth in building trustworthy agentic AI with human oversight, require careful consideration of which action categories trigger each pattern.

Human Involvement Across the GenAI Development Lifecycle

Data Collection and Annotation

In the data development phase, humans collect, curate, and annotate the examples that teach the model what good behavior looks like. Automation can assist at each stage, but for subjective quality dimensions, the human signal sets the ceiling of what the model can learn. Building generative AI datasets with human-in-the-loop workflows covers how annotation workflows direct human effort to the examples that most improve model quality rather than applying uniform review across the full corpus.

Preference Data and Alignment

Reinforcement learning from human feedback is the primary mechanism for aligning generative models with quality, safety, and helpfulness standards. The quality of this preference data depends critically on the representativeness of the annotator population, the specificity of evaluation criteria, and the consistency of annotation guidelines across reviewers. Poor preference data produces aligned-seeming models that optimize for superficial quality signals rather than genuine quality. Human preference optimization at the required quality level is itself a discipline requiring structured workflows, calibrated annotators, and systematic inter-annotator agreement measurement.

Human Judgment as the Evaluation Standard

Automated metrics capture some quality dimensions and miss others. For output dimensions that require contextual judgment, human evaluation is the primary signal. Model evaluation services for production GenAI programs combine automated metrics for the dimensions they can measure reliably with structured human evaluation for the dimensions they cannot, producing an evaluation framework that actually predicts production performance.

Criteria for Choosing Automation in the Inference Pipeline

When Automation Is the Right Default

Common GenAI tasks suited to automation include content classification, where model confidence is high, structured data extraction from documents with a well-defined schema, code completion suggestions where tests verify correctness, and first-pass moderation of clearly violating content where the violation is unambiguous. These tasks share the property that outputs are either verifiably correct or easily triaged by downstream processes.

Confidence Thresholds as the Routing Mechanism

The threshold calibration determines the economics of the system: too high and the review queue contains many outputs that would have been correct, wasting reviewer capacity; too low and errors pass through at a rate that undermines the purpose of automation. A miscalibrated model that confidently produces incorrect outputs, while routing correct outputs to human review as uncertain, is worse than either full automation or full human review. Calibration validation is a prerequisite for deploying confidence-based routing in any context where error consequences are significant.

Criteria for Requiring Human Oversight in the Inference Pipeline

High-Stakes, Irreversible, or Legally Consequential Outputs

Medical triage that directs patient care, legal documents filed on behalf of clients, loan decisions that affect credit history, and communications sent to vulnerable users under stress are all outputs where the cost of model error in specific cases exceeds the efficiency benefit of automating those cases. The model’s average accuracy across the distribution does not determine the acceptability of errors in the highest-stakes subset.

Ambiguous, Novel, or Out-of-Distribution Inputs

A well-designed inference pipeline identifies signals of novelty or ambiguity, low model confidence, unusual input structure, topic categories underrepresented in training, or user signals of sensitive context, and routes those inputs to human review. Trust and safety solutions that monitor the output stream for these signals continuously route potentially harmful or policy-violating outputs to human review before they are served.

Safety, Policy, and Ethical Judgment Calls

A model that has learned patterns for identifying policy violations will exhibit systematic blind spots at the policy boundary, and those blind spots are exactly where human judgment is most needed. Automating the obvious cases while routing boundary cases to human review is not a limitation of the automation. It is the correct architecture for any deployment where policy enforcement has real consequences.

Changing the Economics of Human Annotation

Why Uniform Human Review Is Inefficient

In a system where every output is reviewed by a human, the cost of human oversight scales linearly with volume. Most reviews confirm what was already reliable, diluting the human signal with cases that need no correction and burying it in reviewer fatigue. The improvements to model performance come from the small fraction of uncertain or ambiguous outputs that most annotation programs review at the same rate as everything else.

Active Learning as the Solution

For preference data collection in RLHF, active learning selects the comparison pairs where the model’s behavior is most uncertain or most in conflict with human preferences, focusing annotator effort on the feedback that will most change model behavior. The result is a faster model improvement per annotation hour than uniform sampling produces. Data collection and curation services that integrate active learning into annotation workflow design deliver better model improvement per annotation dollar than uniform-sampling approaches.

The Feedback Loop Between Deployment and Training

This flywheel only operates if the human review workflow is designed to capture corrections in a format usable for training, and if the pipeline connects production corrections back to the training data process. Systems that treat human review as a separate customer service function, disconnected from the engineering organization, rarely close this loop and miss the model improvement opportunity that deployment-time human feedback provides.

How Digital Divide Data Can Help

Digital Divide Data provides human-in-the-loop services across the GenAI development lifecycle and the inference pipeline, with workflows designed to direct human effort to the tasks and output categories where it produces the greatest improvement in model quality and safety.

For development-phase human oversight, human preference optimization services provide structured preference annotation with calibrated reviewers, explicit inter-annotator agreement measurement, and protocols designed to produce the consistent preference signal that RLHF and DPO training requires. Active learning integration concentrates reviewer effort on the comparison pairs that most inform model behavior.

For deployment-phase oversight, trust and safety solutions provide output monitoring, safety boundary routing, and human review workflows that keep GenAI systems aligned with policy and regulatory requirements as output volume scales. Review interfaces are designed to minimize automation bias and support substantive reviewer judgment rather than nominal confirmation.

For programs navigating regulatory requirements, model evaluation services provide the independent human evaluation of model outputs that regulators require as evidence of meaningful oversight, documented with the audit trails that compliance frameworks mandate. Generative AI solutions across the full lifecycle are structured around the principle that human oversight is most valuable when systematically targeted rather than uniformly applied.

Design human-in-the-loop workflows that actually improve model quality where it matters. Talk to an expert.

Conclusion

The choice between human-in-the-loop and full automation for a GenAI system is not a one-time architectural decision. It is an ongoing calibration that should shift as model performance improves, as the production input distribution evolves, and as the program’s understanding of where the model fails becomes more precise. The programs that get this calibration right treat HITL design as a discipline, with explicit criteria for routing decisions, measured assessment of where human judgment adds value versus where it adds only variability, and active feedback loops that connect production corrections back to training data pipelines.

As GenAI systems take on more consequential tasks and as regulators impose more specific oversight requirements, the quality of HITL design becomes a direct determinant of whether programs can scale responsibly. A system where human oversight is nominal, where reviewers are overwhelmed, and corrections are inconsistent, provides neither the safety benefits that justify its cost nor the regulatory compliance it is designed to demonstrate. 

Investing in the workflow design, reviewer calibration, and active learning infrastructure that makes human oversight substantive is what separates programs that scale safely from those that scale their error rates alongside their output volume.

References

European Parliament and the Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

National Institute of Standards and Technology. (2023). AI Risk Management Framework (AI RMF 1.0). NIST. https://doi.org/10.6028/NIST.AI.100-1

Frequently Asked Questions

Q1. What is the difference between human-in-the-loop and human-on-the-loop AI?

Human-in-the-loop places a human as a checkpoint within the pipeline, reviewing or approving individual outputs before they are used. Human-on-the-loop runs automation continuously while humans monitor aggregate system behavior and intervene at the policy level rather than on individual outputs.

Q2. How do you decide which outputs to route to human review in a high-volume GenAI system?

The most practical mechanism is confidence-based routing — directing outputs below a calibrated threshold to human review — but this requires empirical validation that the model’s stated confidence actually correlates with its accuracy before it is used as a routing signal.

Q3. What is automation bias, and why does it undermine human-in-the-loop oversight?

Automation bias is the tendency for reviewers to defer to automated outputs without meaningful assessment, particularly under high volume and time pressure, resulting in nominal oversight where the errors HITL was designed to catch pass through undetected.

Q4. Does active learning reduce the cost of human-in-the-loop annotation for GenAI?

Yes. By identifying which examples would be most informative to annotate, active learning concentrates human effort on the outputs that most improve model performance, producing faster capability gains per annotation hour than uniform sampling of the output stream.

When to Use Human-in-the-Loop vs. Full Automation for Gen AI Read Post »

Model Evaluation for GenAI

Model Evaluation for GenAI: Why Benchmarks Alone Are Not Enough

  1. The gap between benchmark performance and production performance is well understood among practitioners, but it rarely changes how programs approach evaluation in practice. Teams select models based on leaderboard positions, set deployment thresholds based on accuracy scores from public datasets, and, in production, discover that the dimensions that mattered were never measured. 

Benchmark saturation, training data contamination, and the structural limitations of static multiple-choice tests combine to make public benchmarks poor predictors of production behavior for any task that departs meaningfully from the benchmark’s design.

This blog examines why GenAI model evaluation requires a framework that extends well beyond standard benchmarks, covering how benchmark contamination and saturation distort performance signals and what a well-designed evaluation program for a production GenAI system actually looks like. Model evaluation services and human preference optimization are the two evaluation capabilities that production programs most consistently underinvest in relative to the return they deliver.

Why Public Benchmarks are an Unreliable Signal

The Saturation Problem

Many of the most widely cited benchmarks in language model evaluation have saturated. A benchmark saturates when leading models reach near-ceiling scores, at which point the benchmark no longer distinguishes between models of genuinely different capability. Tests that were challenging when first published have been solved or near-solved by frontier models within two to three years of release, rendering them useless for comparative evaluation at the top of the performance distribution.

Saturation is not only a problem for frontier model comparisons. It affects enterprise model selection whenever a team uses a benchmark that was already saturated at the time they ran their evaluation. A model that scores 95% on a saturated benchmark may be no better suited to the production task than a model that scores 88%, and the 7-point gap in the leaderboard number conveys a false sense of differentiation.

The Contamination Problem

Benchmark contamination, where test questions from public evaluation datasets appear in a model’s pre-training corpus, is a pervasive and difficult-to-quantify problem. When a model has seen test set questions during training, its benchmark score reflects memorization rather than generalization. 

The higher the score, the more ambiguous the interpretation: a near-perfect score on a widely published benchmark may indicate genuine capability or extensive training-time exposure to the test set, and there is frequently no reliable way to distinguish between the two from the outside. Detecting and quantifying contamination requires access to training data provenance information that model providers rarely disclose fully.

The practical consequence for teams selecting or evaluating models is that public benchmark scores should be treated as lower-bound estimates of the uncertainty in model capability assessment, not as reliable performance guarantees. This does not mean ignoring benchmarks. It means treating them as one signal among several, weighted by how recently the benchmark was published, how closely its task structure resembles the production task, and how plausible it is that the benchmark data appeared in training.

The Task Structure Mismatch

Most public benchmarks are structured as multiple-choice or short-answer tasks with verifiable correct answers. Most production GenAI tasks are open-ended generation tasks with no single correct answer. The evaluation methods that produce reliable scores on multiple-choice tasks, accuracy against a reference answer key, do not apply to open-ended generation. 

A model that performs well on a multiple-choice reasoning benchmark has demonstrated one capability. Whether it can produce high-quality, contextually appropriate, factually grounded, and tonally suitable open-ended responses to production inputs is a different question that the benchmark does not address.

What Benchmarks Miss: The Dimensions That Determine Production Quality

Behavioral Consistency

A production GenAI system is not evaluated once against a fixed test set. It is evaluated continuously by users who ask the same question in different ways, with different phrasing, different context, and different surrounding conversations. Behavioral consistency, the property that semantically equivalent inputs produce semantically equivalent outputs, is a quality dimension that static benchmarks do not test.

A model that gives contradictory answers to equivalent questions rephrased differently is producing a reliability problem that accuracy on a benchmark will not reveal. Evaluating behavioral consistency requires generating semantically equivalent input variants and measuring output stability, a methodology that requires custom evaluation data collection rather than benchmark lookup.

Calibration and Uncertainty

A well-calibrated model is one whose expressed confidence correlates with its actual accuracy: when it says it is confident, it is usually correct, and when it hedges, it is usually less certain. Calibration is not measured by most public benchmarks. It is an important property for any production system where users make decisions based on model outputs, because an overconfident model that produces plausible-sounding incorrect answers with the same tone and phrasing as correct ones creates a higher risk of harm than a model that signals its uncertainty appropriately.

Robustness to Adversarial and Edge Case Inputs

Benchmarks are designed to be answerable. They contain well-formed, unambiguous questions drawn from the distribution that the benchmark designers anticipated. Production inputs include badly formed queries, ambiguous requests, adversarial attempts to elicit unsafe behavior, and edge cases that fall outside the distribution the model was trained on. Evaluating robustness to these inputs requires test data that was specifically constructed to probe failure modes, not standard benchmark items that were selected because they represent the normal distribution.

Domain-Specific Accuracy in Context

General-purpose benchmarks measure general-purpose capabilities. A healthcare AI system that scores well on general language understanding benchmarks may still produce clinically inaccurate content when deployed in a medical context. A legal AI that excels on reasoning benchmarks may misapply specific statutes. 

Domain accuracy in the deployment context is a distinct evaluation requirement from general benchmark performance, and measuring it requires task-specific evaluation datasets developed with domain expert involvement. Text annotation for domain-specific evaluation data is one of the more consequential investments a deployment program can make, because the domain evaluation set is what will tell the team whether the system is actually reliable in the context it will be used.

Human Evaluation in Model Evaluation for GenAI

Why Automated Metrics Cannot Replace Human Judgment for Generative Tasks

Automated metrics like BLEU, ROUGE, and BERTScore measure overlap between generated text and reference outputs. They are useful for tasks where a reference output exists, and quality can be operationalized as closeness to that reference. For open-ended generation tasks, including summarization, question answering, creative writing, and conversational assistance, there is often no single reference output, and quality has dimensions that overlap metrics cannot capture: helpfulness, appropriate tone, factual accuracy, contextual relevance, and safety.

Human evaluation fills this gap. It captures the dimensions of output quality that automated metrics miss, and it reflects the actual user experience in a way that reference-based metrics cannot. The cost of human evaluation is real, but so is the cost of deploying a model whose quality on the dimensions that matter was never measured.

What Human Evaluation Should Measure

A well-designed human evaluation for a production GenAI system measures multiple output dimensions independently rather than asking evaluators to produce a single overall quality score. Factual accuracy, assessed by evaluators with domain expertise. Helpfulness, assessed by evaluators representing the target user population. Tone appropriateness is assessed against the system’s stated behavioral guidelines. Safety, assessed against a comprehensive set of harm categories relevant to the deployment context. 

Collecting these signals systematically and at scale requires an annotation infrastructure that treats human evaluation as a first-class engineering discipline, not an ad hoc review process. Building GenAI datasets with human-in-the-loop workflows covers the methodological foundations for this kind of systematic human signal collection.

The LLM-as-Judge Approach and Its Limits

Using a language model as an automated evaluator, the LLM-as-judge approach is increasingly common as a way to scale evaluation beyond what human annotation capacity allows. It captures some dimensions of quality better than reference-based metrics and can process large evaluation sets quickly. The method has documented limitations that teams should understand before relying on it as the primary evaluation signal.

LLMs used as judges exhibit systematic biases: preference for longer responses, preference for outputs from architecturally similar models, sensitivity to framing and ordering of the options presented. For safety-critical evaluation, these biases matter. A system evaluated primarily by LLM judges that were themselves trained on similar data may be systematically blind to the failure modes most likely to produce unsafe or incorrect behavior in deployment. Human evaluation remains essential for validating the reliability of LLM judge behavior and for any dimension where systematic bias in the judge would have consequential downstream effects.

Task-Specific and Deployment-Specific Evaluation

Building Evaluation Sets That Reflect the Production Task

The most reliable predictor of production performance is evaluation against a dataset that closely reflects the actual production input distribution. This means drawing evaluation inputs from real user queries where available, constructing synthetic inputs that cover the realistic variation range of the production task, and including explicit coverage of the edge cases and unusual inputs that the production workload contains. 

A program that builds its evaluation set from the production data distribution, rather than from public benchmark datasets, will have a much more accurate picture of whether its model is ready for deployment. Data collection and curation services that sample from or synthesize production-representative inputs are a direct investment in evaluation accuracy.

Red-Teaming as a Systematic Evaluation Method

Red-teaming, the systematic attempt to elicit harmful, unsafe, or policy-violating behavior from a model using carefully constructed adversarial inputs, is an evaluation method that public benchmarks do not replicate. 

A model can score well on every standard safety benchmark while being vulnerable to specific adversarial prompt patterns that a motivated user could discover. Red-teaming before deployment is the most reliable way to identify these vulnerabilities. It requires evaluators with the expertise and mandate to attempt to break the system, not just to assess its average-case behavior. Trust and safety evaluation that incorporates systematic red-teaming alongside standard safety metrics provides a safety assurance signal that automated safety benchmark scores cannot supply.

Regression Testing Across Model Versions

A model evaluation program is not a point-in-time exercise. Models are updated, fine-tuned, and modified throughout their deployment lifecycle, and each change that affects a safety-relevant or quality-relevant behavior needs to be evaluated against the previous version before deployment. A regression test suite that runs on each model update catches capability degradations before they reach users. Building and maintaining this suite is an ongoing investment that most programs underestimate at project inception.

Evaluating RAG Systems for Gen AI

Retrieval-augmented generation systems have a more complex failure surface than standalone language models. The retrieval component can fail to find relevant documents. The reranking component can return the wrong documents as the most relevant. The generation component can fail to use the retrieved documents correctly, ignoring relevant content or hallucinating content not present in the retrieved context. 

Evaluating a RAG system requires measuring each of these components separately, not just the end-to-end output quality. End-to-end metrics that look good can mask retrieval failures that are being compensated for by a capable generator, or generation quality failures that are being compensated for by excellent retrieval. DDD’s detailed guide on RAG data quality, evaluation, and governance covers the RAG-specific evaluation methodology in depth.

Context Faithfulness as a Core RAG Evaluation Metric

Context faithfulness, the property that generated responses are grounded in and consistent with the retrieved context rather than generated from the model’s parametric knowledge, is a critical evaluation dimension for RAG systems that standard output quality metrics do not assess. 

A RAG system that produces accurate responses by ignoring the retrieved context and falling back on parametric knowledge is not providing the factual grounding that the RAG architecture was intended to supply. Measuring context faithfulness requires an evaluation methodology that compares the generated output against the retrieved documents, not just against a reference answer.

Evaluating Agentic AI Systems

Why Task Completion Is Not Enough

Agentic AI systems take sequences of actions in dynamic environments, using tools, APIs, and external services to accomplish multi-step goals. Evaluating them requires a fundamentally different framework from evaluating single-turn text generation. Task completion rate, whether the agent successfully achieves the stated goal, is a necessary but insufficient evaluation metric. 

An agent that completes tasks using inefficient action sequences, makes unnecessary tool calls, or produces correct outcomes through reasoning paths that would fail on slightly different inputs is not a reliable production system, even if its task completion rate looks acceptable. Building trustworthy agentic AI with human oversight discusses the evaluation and governance frameworks that agentic systems require.

Reliability, Safety, and Trajectory Evaluation

Agentic evaluation needs to measure at least four dimensions beyond task completion: reasoning trajectory quality, which assesses whether the agent’s reasoning steps are sound even when the outcome is correct; tool use accuracy, which evaluates whether tools are invoked appropriately with correct parameters; robustness to unexpected inputs during multi-turn interactions; and safety under adversarial conditions, including attempts to manipulate the agent into taking unauthorized actions. Human-in-the-loop evaluation remains the reference standard for agentic safety assessment, particularly for systems that take actions with real-world consequences. Agentic AI deployments that skip systematic safety evaluation before production release create liability exposure that standard output quality metrics will not have revealed.

The Evaluation Stack: What a Complete Program Looks Like

Layering Benchmark, Automated, and Human Evaluation

A complete evaluation program for a production GenAI system combines multiple layers. Public benchmarks provide broad capability signals and facilitate external comparisons, with appropriate discounting for contamination risk and saturation. Automated metrics, including reference-based metrics for structured tasks and LLM-judge approaches for open-ended generation, provide scalable quality signals that can run on large evaluation sets.

Human evaluation provides the ground truth for dimensions that automated methods cannot reliably assess, including safety, domain accuracy, and output quality in the deployment context. Each layer informs a different aspect of the deployment decision.

The Evaluation Timeline

Evaluation should be integrated into the development lifecycle, not run as a pre-deployment checkpoint. Capability assessment runs during model or fine-tuning selection. Task-specific evaluation runs after initial fine-tuning to assess whether the fine-tuned model actually improved on the target task. Red-teaming and safety evaluation run before any production deployment. Regression testing runs on every model update that touches safety-relevant or quality-relevant components. Post-deployment monitoring provides an ongoing signal that the production distribution has not drifted in ways that have degraded model performance.

The Common Gap: Evaluation Data Quality

The most common single failure point in enterprise evaluation programs is not the choice of metrics or the evaluation methodology. It is the quality and representativeness of the evaluation data itself. 

An evaluation set that was assembled quickly from available examples, which over-represents easy cases and under-represents the edge cases and domain variations that matter for production reliability, will produce evaluation scores that overestimate the model’s readiness for deployment. Annotation solutions that bring the same quality discipline to evaluation data as to training data are a structural requirement for evaluation programs that actually predict production performance.

How Digital Divide Data Can Help

Digital Divide Data provides an end-to-end evaluation infrastructure for GenAI programs, from evaluation dataset design through human annotation and LLM-judge calibration to ongoing regression testing and post-deployment monitoring.

The model evaluation services cover task-specific evaluation dataset construction, with explicit coverage of edge cases, domain-specific inputs, and behavioral consistency test variants. Evaluation sets are built from production-representative inputs rather than repurposed public benchmarks, producing evaluation scores that predict deployment performance rather than benchmark-suite performance.

For safety and quality evaluation, human preference optimization services provide systematic human quality signal collection across the dimensions that automated metrics miss: factual accuracy, helpfulness, tone appropriateness, and safety. Red-teaming capability is integrated into safety evaluation workflows, covering adversarial prompt patterns relevant to the specific deployment context rather than generic safety benchmarks.

For agentic deployments, evaluation methodology extends to trajectory assessment, tool use accuracy, and multi-turn robustness, with human evaluation covering the safety-critical judgment calls that LLMs cannot reliably assess. Trust and safety solutions include structured red-teaming protocols and ongoing monitoring frameworks that keep the safety signal current as models and user behavior evolve.

Talk to an Expert and build an evaluation program that actually predicts production performance

Conclusion

Benchmark scores are starting points for model assessment, not finishing lines. The dimensions that determine whether a GenAI system actually performs in production, behavioral consistency, calibration, domain accuracy, safety under adversarial conditions, and output quality on open-ended tasks are systematically undercovered by public benchmarks and require a purpose-built evaluation methodology to measure reliably. 

Teams that invest in evaluation infrastructure commensurate with what they invest in model development will have an accurate picture of their system’s readiness before deployment. Teams that rely on benchmark numbers as their primary evidence for production readiness will consistently be surprised by what they encounter after launch.

As GenAI systems take on more consequential tasks, including customer-facing interactions, regulated industry applications, and agentic workflows with real-world effects, the cost of inadequate evaluation rises accordingly. 

The investment in evaluation data quality, human annotation capacity, and task-specific evaluation methodology is not overhead on the development program. It is the mechanism that transforms a model that performs in controlled conditions into a system that can be trusted in production. Generative AI evaluation built around production-representative data and systematic human quality signal is the foundation that makes that trust warranted.

References

Mohammadi, M., Li, Y., Lo, J., & Yip, W. (2025). Evaluation and benchmarking of LLM agents: A survey. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. ACM. https://doi.org/10.1145/3711896.3736570

Stanford HAI. (2024). Technical performance. 2024 AI Index Report. Stanford University Human-Centered AI. https://hai.stanford.edu/ai-index/2024-ai-index-report/technical-performance

Frequently Asked Questions

Q1. What is benchmark contamination, and why does it matter for model selection?

Benchmark contamination occurs when test questions from public datasets appear in a model’s pre-training corpus, causing scores to reflect memorization rather than genuine capability, which means leaderboard rankings may not accurately reflect how models will perform on unseen production inputs.

Q2. When is human evaluation necessary versus automated metrics?

Human evaluation is necessary for open-ended generation tasks where quality has subjective dimensions, for safety-critical judgment calls where automated judge bias could mask failure modes, and for domain-specific accuracy assessment that requires expert knowledge.

Q3. What evaluation dimensions do public benchmarks consistently miss?

Behavioral consistency across rephrased inputs, output calibration, robustness to adversarial inputs, domain accuracy in specific deployment contexts, and open-ended generation quality are the dimensions most systematically undercovered by standard public benchmarks.

Q4. How should RAG systems be evaluated differently from standalone language models?

RAG evaluation requires measuring retrieval component performance, reranking accuracy, and context faithfulness separately from end-to-end output quality, since good end-to-end results can mask component failures that will cause problems under different input distributions.

Model Evaluation for GenAI: Why Benchmarks Alone Are Not Enough Read Post »

Multimodal AI Training

Multimodal AI Training: What the Data Actually Demands

The difficulty of multimodal training data is not simply that there is more of it to produce. It is that the relationships between modalities must be correct, not just the data within each modality. An image that is accurately labeled for object detection but paired with a caption that misrepresents the scene produces a model that learns a contradictory representation of reality. 

A video correctly annotated for action recognition but whose audio is misaligned with the visual frames teaches the model the wrong temporal relationship between what happens and how it sounds. These cross-modal consistency problems do not show up in single-modality quality checks. They require a different category of annotation discipline and quality assurance, one that the industry is still in the process of developing the infrastructure to apply at scale.

This blog examines what multimodal AI training actually demands from a data perspective, covering how cross-modal alignment determines model behavior, what annotation quality requirements differ across image, video, and audio modalities, why multimodal hallucination is primarily a data problem rather than an architecture problem, how the data requirements shift as multimodal systems move into embodied and agentic applications, and what development teams need to get right before their training data.

What Multimodal AI Training Actually Involves

The Architecture and Where Data Shapes It

Multimodal large language models process inputs from multiple data types by routing each through a modality-specific encoder that converts raw data into a mathematical representation, then passing those representations through a fusion mechanism that aligns and combines them into a shared embedding space that the language model backbone can operate over. The vision encoder handles images and video frames. The audio encoder handles speech and sound. The text encoder handles written content. The fusion layer or connector module is where the modalities are brought together, and it is the component whose quality is most directly determined by the quality of the training data.

A fusion layer that has been trained on accurately paired, consistently annotated, well-aligned multimodal data learns to produce representations where the image of a dog and the word dog, and the sound of a bark occupy regions of the embedding space that are meaningfully related. A fusion layer trained on noisily paired, inconsistently annotated data learns a blurrier, less reliable mapping that produces the hallucination and cross-modal reasoning failures that characterize underperforming multimodal systems. The architecture sets the ceiling. The training data determines how close to that ceiling the deployed model performs.

The Scale Requirement That Changes the Data Economics

Multimodal systems require significantly more training data than their unimodal counterparts, not only in absolute volume but in the combinatorial variety needed to train the cross-modal relationships that define the system’s capabilities. A vision-language model that is trained primarily on image-caption pairs from a narrow visual domain will learn image-language relationships within that domain and generalize poorly to images with different characteristics, different object categories, or different spatial arrangements. 

The diversity requirement is multiplicative across modalities: a system that needs to handle diverse images, diverse language, and diverse audio needs training data whose diversity spans all three dimensions simultaneously, which is a considerably harder curation problem than assembling diverse data in any one modality.

Cross-Modal Alignment: The Central Data Quality Problem

What Alignment Means and Why It Fails

Cross-modal alignment is the property that makes a multimodal model genuinely multimodal rather than simply a collection of unimodal models whose outputs are concatenated. A model with good cross-modal alignment has learned that the visual representation of a specific object class, the textual description of that class, and the auditory signature associated with it are related, and it uses that learned relationship to improve its performance on tasks that involve any combination of the three. A model with poor cross-modal alignment has learned statistical correlations within each modality separately but has not learned the deeper relationships between them.

Alignment failures in training data take several forms. The most straightforward is incorrect pairing: an image paired with a caption that does not accurately describe it, a video clip paired with a transcript that corresponds to a different moment, or an audio recording labeled with a description of a different sound source. Less obvious but equally damaging is partial alignment: a caption that accurately describes some elements of the image but misses others, a transcript that is textually accurate but temporally misaligned with the audio, or an annotation that correctly labels the dominant object in a scene but ignores the contextual elements that determine the scene’s meaning.

The Temporal Alignment Problem in Video and Audio

Temporal alignment is a specific and particularly demanding form of cross-modal alignment that arises in video and audio data. A video is not a collection of independent frames. It is a sequence in which the relationship between what happens at time T and what happens at time T+1 carries meaning that neither frame conveys alone. An action recognition model trained on video data where frame-level annotations do not accurately reflect the temporal extent of the action, or where the action label is assigned to the wrong temporal segment, learns an imprecise representation of the action’s dynamics. Video annotation for multimodal training requires temporal precision that static image annotation does not, including accurate action boundary detection, consistent labeling of motion across frames, and synchronization between visual events and their corresponding audio or textual descriptions.

Audio-visual synchronization is a related challenge that receives less attention than it deserves in multimodal data quality discussions. Human speech is perceived as synchronous with lip movements within a tolerance of roughly 40 to 100 milliseconds. Outside that window, the perceptual mismatch is noticeable to human observers. For a multimodal model learning audio-visual correspondence, even smaller misalignments can introduce noise into the learned relationship between the audio signal and the visual event it accompanies. At scale, systematic small misalignments across a large training corpus can produce a model that has learned a subtly incorrect temporal model of the audio-visual world.

Image Annotation for Multimodal Training

Beyond Object Detection Labels

Image annotation for multimodal training differs from image annotation for standard computer vision in a dimension that is easy to underestimate: the relationship between the image content and the language that describes it is part of what is being learned, not a byproduct of the annotation. 

An object detection label that places a bounding box around a car is sufficient for training a car detector. The same bounding box is insufficient for training a vision-language model, because the model needs to learn not only that the object is a car but how the visual appearance of that car relates to the range of language that might describe it: vehicle, automobile, sedan, the red car in the foreground, the car partially occluded by the pedestrian. Image annotation services designed for multimodal training need to produce richer, more linguistically diverse descriptions than standard computer vision annotation, and the consistency of those descriptions across similar images is a quality dimension that directly affects cross-modal alignment.

The Caption Diversity Requirement

Caption diversity is a specific data quality requirement for vision-language model training that is frequently underappreciated. A model trained on image-caption pairs where all captions follow a similar template learns to associate visual features with a narrow range of linguistic expression. The model will perform well on evaluation tasks that use similar language but will generalize poorly to the diversity of phrasing, vocabulary, and descriptive style that real-world applications produce. Producing captions with sufficient linguistic diversity while maintaining semantic accuracy requires annotation workflows that explicitly vary phrasing, descriptive focus, and level of detail across multiple captions for the same image, rather than treating caption generation as a single-pass labeling task.

Spatial Relationship and Compositional Annotation

Spatial relationship annotation, which labels the geometric and semantic relationships between objects within an image rather than just the identities of the objects themselves, is a category of annotation that matters significantly more for multimodal model training than for standard object detection.

A vision-language model that needs to answer the question which cup is to the left of the keyboard requires training data that explicitly annotates spatial relationships, not just object identities. The compositional reasoning failures that characterize many current vision-language models, where the model correctly identifies all objects in a scene but fails on questions about their spatial or semantic relationships, are in part a reflection of training data that under-annotates these relationships.

Video Annotation: The Complexity That Scale Does Not Resolve

Why Video Annotation Is Not Image Annotation at Scale

Video is not a large collection of images. The temporal dimension introduces annotation requirements that have no equivalent in static image labeling. Action boundaries, the precise frame at which an action begins and ends, must be annotated consistently across thousands of video clips for the model to learn accurate representations of action timing. Event co-occurrence relationships, which events happen simultaneously and which happen sequentially, must be annotated explicitly rather than inferred. 

Long-range temporal dependencies, where an event at the beginning of a clip affects the interpretation of an event at the end, require annotators who watch and understand the full clip before making frame-level annotations. 

Dense Video Captioning and the Annotation Depth It Requires

Dense video captioning, the task of generating textual descriptions of all events in a video with accurate temporal localization, is one of the most data-demanding tasks in multimodal AI training. Training data for dense captioning requires that every significant event in a video clip be identified, temporally localized to its start and end frames, and described in natural language with sufficient specificity to distinguish it from similar events in other clips. The annotation effort per minute of video for dense captioning is dramatically higher than for single-label video classification, and the quality of the temporal localization directly determines the precision of the cross-modal correspondence the model learns.

Multi-Camera and Multi-View Video

As multimodal AI systems move into embodied and Physical AI applications, video annotation requirements extend to multi-camera setups where the same event must be annotated consistently across multiple viewpoints simultaneously. 

A manipulation action that is visible from the robot’s wrist camera, the overhead camera, and a side camera must be labeled with consistent action boundaries, consistent object identities, and consistent descriptions across all three views. Inconsistencies across views produce training data that teaches the model contradictory representations of the same physical event. The multisensor fusion annotation challenges that arise in Physical AI settings apply equally to multi-view video annotation, and the annotation infrastructure needed to handle them is considerably more complex than what single-camera video annotation requires.

Audio Annotation: The Modality Whose Data Quality Is Least Standardized

What Audio Annotation for Multimodal Training Requires

Audio annotation for multimodal training is less standardized than image or text annotation, and the quality standards that exist in the field are less widely adopted. A multimodal system that processes speech needs training data where speech is accurately transcribed, speaker-attributed in multi-speaker contexts, and annotated for the non-linguistic features, tone, emotion, pace, and prosody that carry meaning beyond the words themselves. 

A system that processes environmental audio needs training data where sound events are accurately identified, temporally localized, and described in a way that captures the semantic relationship between the sound and its source. Audio annotation at the quality level that multimodal model training requires is more demanding than transcription alone, and teams that treat audio annotation as a transcription task will produce training data that gives their models a linguistically accurate but perceptually shallow representation of audio content.

The Language Coverage Problem in Audio Training Data

Audio training data for speech-capable multimodal systems faces an acute version of the language coverage problem that affects text-only language model training. Systems trained predominantly on English speech data perform significantly worse on other languages, and the performance gap is larger for audio than for text because the acoustic characteristics of speech vary across languages in ways that require explicit representation in the training data rather than cross-lingual transfer. 

Building multimodal systems that perform equitably across languages requires intentional investment in audio data collection and annotation across linguistic communities, an investment that most programs underweight relative to its impact on deployed model performance. Low-resource languages in AI are directly relevant to audio-grounded multimodal training, where low-resource language communities face the sharpest capability gaps.

Emotion and Paralinguistic Annotation

Paralinguistic annotation, the labeling of speech features that convey meaning beyond the literal content of the words, is a category of audio annotation that is increasingly important for multimodal systems designed for human interaction applications. Tone, emotional valence, speech rate variation, and prosodic emphasis all carry semantic information that a model interacting with humans needs to process correctly. Annotating these features requires annotators who can make consistent judgments about inherently subjective qualities, which in turn requires annotation guidelines that are specific enough to produce inter-annotator agreement and quality assurance processes that measure that agreement systematically.

Multimodal Hallucination: A Data Problem More Than an Architecture Problem

How Hallucination in Multimodal Models Differs From Text-Only Hallucination

Hallucination in language models is a well-documented failure mode where the model generates content that is plausible in form but factually incorrect. In multimodal models, hallucination takes an additional dimension: the model generates content that is inconsistent with the visual or audio input it has been given, not just with external reality. A model that correctly processes an image of an empty table but generates a description that includes objects not present in the image is exhibiting cross-modal hallucination, a failure mode distinct from factual hallucination and caused by a different mechanism.

Cross-modal hallucination is primarily a training data problem. It arises when the training data contains image-caption pairs where the caption describes content not visible in the image, when the model has been exposed to so much text describing common image configurations that it generates those descriptions regardless of what the image actually shows, or when the cross-modal alignment in the training data is weak enough that the model’s language prior dominates its visual processing. The tendency for multimodal models to generate plausible-sounding descriptions that prioritize language fluency over visual fidelity is a direct consequence of training data where language quality was prioritized over cross-modal accuracy.

How Training Data Design Can Reduce Hallucination

Reducing cross-modal hallucination through training data design requires explicit attention to the accuracy of the correspondence between modalities, not just the quality of each modality independently. Negative examples that show the model what it looks like when language is inconsistent with visual content, preference data that systematically favors visually grounded descriptions over hallucinated ones, and fine-grained correction annotations that identify specific hallucinated elements and provide corrected descriptions are all categories of training data that target the cross-modal alignment failure underlying hallucination. Human preference optimization approaches applied specifically to cross-modal faithfulness, where human annotators compare model outputs for their visual grounding rather than general quality, are among the most effective interventions currently in use for reducing multimodal hallucination in production systems.

Evaluation Data for Hallucination Assessment

Measuring hallucination in multimodal models requires evaluation data that is specifically designed to surface cross-modal inconsistencies, not just general performance benchmarks. Evaluation sets that include images with unusual configurations, rare object combinations, and scenes that contradict common statistical associations are more diagnostic of hallucination than standard benchmark images that conform to typical visual patterns the model has likely seen during training. Building evaluation data specifically for hallucination assessment is a distinct annotation task from building training data; model evaluation services are addressed through targeted adversarial data curation designed to reveal the specific cross-modal failure modes most relevant to each system’s deployment context.

Multimodal Data for Embodied and Agentic AI

When Modalities Include Action

The multimodal AI training challenge takes on additional complexity when the system is not only processing visual, audio, and language inputs but also taking actions in the physical world. Vision-language-action models, which underpin much of the current development in robotics and Physical AI, must learn not only to understand what they see and hear but to connect that understanding to appropriate physical actions. 

The training data for these systems is not image-caption pairs. It is sensorimotor sequences: synchronized streams of visual input, proprioceptive sensor readings, force feedback, and the action commands that a human operator or an expert policy selects in response to those inputs. VLA model analysis services and the broader context of vision-language-action models and autonomy address the annotation demands specific to this category of multimodal training data.

Instruction Tuning Data for Multimodal Agents

Instruction tuning for multimodal agents, which teaches a system to follow complex multi-step instructions that involve perception, reasoning, and action, requires training data that is structured differently from standard multimodal pairs. Each training example is a sequence: an instruction, a series of observations, a series of intermediate reasoning steps, and a series of actions, all of which need to be consistently annotated and correctly attributed. The annotation effort for multimodal instruction tuning data is substantially higher per example than for standard image-caption pairs, and the quality standards are more demanding because errors in the action sequence or the reasoning annotation propagate directly into the model’s learned behavior. Building generative AI datasets with human-in-the-loop workflows is particularly valuable for this category of training data, where the judgment required to evaluate whether a multi-step action sequence is correctly annotated exceeds what automated quality checks can reliably assess.

Quality Assurance Across Modalities

Why Single-Modality QA Is Not Enough

Quality assurance for multimodal training data requires checking not only within each modality but across modalities simultaneously. A QA process that verifies image annotation quality independently and caption quality independently will pass image-caption pairs where both elements are individually correct, but the pairing is inaccurate. A QA process that checks audio transcription quality independently and video annotation quality independently will pass audio-video pairs where the transcript is accurate but temporally misaligned with the video. Cross-modal QA, which treats the relationship between modalities as the primary quality dimension, is a distinct capability from single-modality QA and requires annotation infrastructure and annotator training that most programs have not yet fully developed.

Inter-Annotator Agreement in Multimodal Annotation

Inter-annotator agreement, the standard quality metric for annotation consistency, is more complex to measure in multimodal settings than in single-modality settings. Agreement on object identity within an image is straightforward to quantify. Agreement on whether a caption accurately represents the full semantic content of an image requires subjective judgment that different annotators may apply differently. 

Agreement on the correct temporal boundary of an action in a video requires a level of precision that different annotators may interpret differently, even when given identical guidelines. Building annotation guidelines that are specific enough to produce measurable inter-annotator agreement on cross-modal quality dimensions, and measuring that agreement systematically, is a precondition for the kind of training data quality that production of multimodal systems requires.

Trust and Safety Annotation in Multimodal Data

Multimodal training data introduces trust and safety annotation requirements that are qualitatively different from text-only content moderation. Images and videos can carry harmful content in ways that text descriptions do not capture. Audio can include harmful speech that automated transcription produces as apparently neutral text. The combination of modalities can produce harmful associations that would not arise from either modality alone. Trust and safety solutions for multimodal systems need to operate across all modalities simultaneously and need to be designed with the specific cross-modal harmful content patterns in mind, not simply extended from text-only content moderation frameworks.

How Digital Divide Data Can Help

Digital Divide Data provides end-to-end multimodal data solutions for AI development programs across the full modality stack. The approach is built around the recognition that multimodal model quality is determined by cross-modal data quality, not by the quality of each modality independently, and that the annotation infrastructure to assess and ensure cross-modal quality requires specific investment rather than extension of single-modality workflows.

On the image side, our image annotation services produce the linguistically diverse, relationship-rich, spatially accurate descriptions that vision-language model training requires, with explicit coverage of compositional and spatial relationships rather than object identity alone. Caption diversity and cross-modal consistency are treated as primary quality dimensions in annotation guidelines and QA protocols.

On the video side, our video annotation capabilities address the temporal annotation requirements of multimodal training data with clip-level understanding as a prerequisite for frame-level labeling, consistent action boundary detection, and synchronization between visual, audio, and textual annotation streams. For embodied AI programs, DDD’s annotation teams handle multi-camera, multi-view annotation with cross-view consistency required for action model training.

On the audio side, our annotation services extend beyond transcription to include paralinguistic feature annotation, speaker attribution, sound event localization, and multilingual coverage, with explicit attention to low-resource linguistic communities. For multimodal programs targeting equitable performance across languages, DDD provides the audio data coverage that standard English-dominant datasets cannot supply.

For programs addressing multimodal hallucination, our human preference optimization services include cross-modal faithfulness evaluation, producing preference data that specifically targets the visual grounding failures underlying hallucination. Model evaluation services provide adversarial multimodal evaluation sets designed to surface hallucination and cross-modal reasoning failures before they appear in production.

Build multimodal AI systems grounded in data that actually integrates modalities. Talk to an expert!

Conclusion

Multimodal AI training is not primarily a harder version of unimodal training. It is a different kind of problem, one where the quality of the relationships between modalities determines model behavior more than the quality of each modality independently. The teams that produce the most capable multimodal systems are not those with the largest training corpora or the most sophisticated architectures. 

They are those that invest in annotation infrastructure that can produce and verify cross-modal accuracy at scale, in evaluation frameworks that measure cross-modal reasoning and hallucination rather than unimodal benchmarks, and in data diversity strategies that explicitly span the variation space across all modalities simultaneously. Each of these investments requires a level of annotation sophistication that is higher than what single-modality programs have needed, and teams that attempt to scale unimodal annotation infrastructure to multimodal requirements will consistently find that the cross-modal quality gaps they did not build for are the gaps that limit their model’s real-world performance.

The trajectory of AI development is toward systems that process the world the way humans do, through the simultaneous integration of what they see, hear, read, and do. That trajectory makes multimodal training data quality an increasingly central competitive factor rather than a technical detail. Programs that build the annotation infrastructure, quality assurance processes, and cross-modal consistency standards now will be better positioned to develop the next generation of multimodal capabilities than those that treat data quality as a problem to be addressed after model performance plateaus. 

Digital Divide Data is built to provide the multimodal data infrastructure that makes that early investment possible across every modality that production AI systems require.

References

Lan, Z., Chakraborty, R., Munikoti, S., & Agarwal, S. (2025). Multimodal AI: Integrating diverse data modalities for advanced intelligence. Emergent Mind. https://www.emergentmind.com/topics/multimodal-ai

Gui, L. (2025). Toward data-efficient multimodal learning. Carnegie Mellon University Language Technologies Institute Dissertation. https://lti.cmu.edu/research/dissertations/gui-liangke-dissertation-document.pdf

Chen, L., Lin, F., Shen, Y., Cai, Z., Chen, B., Zhao, Z., Liang, T., & Zhu, W. (2025). Efficient multimodal large language models: A survey. Visual Intelligence, 3(10). https://doi.org/10.1007/s44267-025-00099-6

Frequently Asked Questions

What makes multimodal training data harder to produce than single-modality data?

Cross-modal alignment accuracy, where the relationship between modalities must be correct rather than just the content within each modality, adds a quality dimension that single-modality annotation workflows are not designed to verify and that requires distinct QA infrastructure to assess systematically.

What is cross-modal hallucination, and how is it different from standard LLM hallucination?

Cross-modal hallucination occurs when a multimodal model generates content inconsistent with its visual or audio input, rather than just inconsistent with factual reality, arising from weak cross-modal alignment in training data rather than from language model statistical biases alone.

How much more training data does a multimodal system need compared to a text-only model?

The volume requirement is substantially higher because diversity must span multiple modality dimensions simultaneously, and quality requirements are more demanding since cross-modal accuracy must be verified in addition to within-modality quality.

Why is temporal alignment in video annotation so important for multimodal model training?

Temporal misalignment in video annotation teaches the model incorrect associations between what happens visually and what is described linguistically or heard aurally, producing models with systematically wrong temporal representations of events and actions.

Multimodal AI Training: What the Data Actually Demands Read Post »

LLM Fine-Tuning

Why Most Enterprise LLM Fine-Tuning Projects Underdeliver

The premise of enterprise LLM fine-tuning is straightforward enough to be compelling. Take a capable general-purpose language model, train it further on proprietary data from your domain, and get a model that performs markedly better on the tasks that matter to your organization. 

The gap between that premise and what most enterprise fine-tuning projects actually deliver is wide enough to have become one of the more reliably frustrating patterns in enterprise AI adoption. Teams spend months on data preparation and training runs, consume substantial GPU budgets, and arrive at a model that performs comparably to the base model they started with, or worse, performs well on the benchmark they optimized for and poorly on the actual production workload.

The gap is not primarily a technical failure. The algorithms work. Parameter-efficient fine-tuning techniques have matured significantly and are accessible to any team with reasonable engineering resources. The failures are upstream and downstream of the training run itself: in the quality and relevance of the training data, in the mismatch between the fine-tuning objective and the actual production task, in the absence of evaluation frameworks that measure what actually matters, and in the organizational assumptions about what fine-tuning is and is not appropriate for. Addressing these failures requires a clearer understanding of what enterprise LLM fine-tuning can and cannot be expected to deliver, and what the preconditions for a project that actually closes the performance gap look like.

This blog examines why most enterprise LLM fine-tuning projects underdeliver, covering the structural reasons that data quality problems dominate fine-tuning outcomes, and how catastrophic forgetting undermines performance.

What Enterprise Fine-Tuning Is Actually Trying to Solve

The Gap That Fine-Tuning Is Supposed to Close

A general-purpose language model trained on broad internet-scale data has learned a great deal about language, reasoning, and general world knowledge. What it has not learned is your organization’s specific terminology, your domain’s particular conventions, your internal document formats, your compliance constraints, or the nuanced judgment calls your subject matter experts make. Fine-tuning promises that additional training on domain-specific examples can close that gap, producing a model that speaks your domain’s language, follows your conventions, and applies the judgment patterns you need.

That promise is real, but it is more conditional than it usually appears in the initial project framing. Fine-tuning is effective at teaching a model to change its style, follow specific output formats, apply domain vocabulary consistently, and replicate the structure of domain-specific responses. It is considerably less effective at teaching a model new factual knowledge, correcting systematic reasoning errors in the base model, or producing reliable behavior on tasks that differ in meaningful ways from the fine-tuning examples. The mismatch between what teams expect fine-tuning to accomplish and what it reliably delivers is the first place where projects begin to underdeliver.

When Fine-Tuning Is the Right Tool

Fine-tuning is most effective when the production task has a consistent structure that can be demonstrated through examples, when the required behavior is primarily a matter of style, format, or domain register rather than novel knowledge, and when a sufficient volume of high-quality task-representative examples can be assembled. 

Legal document summarization with consistent output structure, customer service response generation in a specific organizational tone, and clinical note formatting for a defined documentation standard: these are use cases where fine-tuning is likely to deliver measurable improvement over prompting alone. Tasks that require the model to retrieve specific factual information, reason across long documents, or apply judgment that varies substantially across cases are often better addressed through retrieval-augmented generation or prompt engineering, and deploying fine-tuning for them is a common source of underperformance.

The Data Quality Problem That Derails Most Projects

Why Training Data Quality Is the Primary Determinant of Fine-Tuning Outcomes

The most consistent finding across enterprise fine-tuning programs that underdeliver is that the training data was not as good as the team believed it to be. This is not a subtle problem. It is the dominant failure mode, appearing in various forms across virtually every project that does not achieve its intended performance improvement. 

The relationship between training data quality and fine-tuning outcome is more direct than in pre-training, because the fine-tuning dataset is small enough that individual quality problems have disproportionate influence on the model’s learned behavior. A systematic error in a pre-training corpus of a hundred billion tokens will have a negligible effect on the model’s overall behavior. The same systematic error in a fine-tuning dataset of ten thousand examples will produce a model that reliably replicates the error. 

The Three Most Common Data Quality Failures

The first is inconsistency across examples. Enterprise data assembled from operational systems, human-written documents, or labeled outputs from multiple annotators will typically contain inconsistent patterns: different levels of formality, different approaches to similar cases, and different levels of detail. A model trained on this inconsistency does not learn a clear behavior pattern. It learns an average of conflicting patterns, which produces outputs that are neither definitively one approach nor definitively another, and that satisfy no one’s actual requirements.

The second is contamination by low-quality examples that are included because they are available rather than because they are good. In enterprise data collection, the temptation to include more examples to reach a volume target is strong, and the quality bar for inclusion is often lower than it should be. Examples that are technically correct but poorly constructed, that use domain vocabulary inconsistently, or that apply the target behavior only partially will actively degrade model performance relative to a smaller, cleaner dataset. The quality-over-quantity principle in fine-tuning data assembly is not a platitude. It reflects how the fine-tuning gradient update works: every example in the dataset shifts the model’s parameters, and bad examples shift them in the wrong direction. Text annotation services that apply consistent quality standards across the full dataset, rather than accepting examples that merely pass a minimum threshold, are a structural requirement for fine-tuning data that actually improves model performance.

The third is a distribution mismatch between the fine-tuning data and the actual production inputs. Teams often assemble fine-tuning data from the examples that are easiest to collect, which are the well-structured, easy cases. The production workload includes edge cases, ambiguous inputs, unusual phrasing patterns, and domain variants that the easy-case dataset does not cover. A model fine-tuned on the easy cases will perform well on easy cases and no better than the base model on everything else. If the easy cases constitute a minority of the production workload, the fine-tuning project will yield disappointing real-world results even when benchmark metrics appear acceptable.

Catastrophic Forgetting: The Problem Teams Discover Too Late

What Catastrophic Forgetting Actually Means in Practice

Catastrophic forgetting is the phenomenon where a language model, when fine-tuned on a specific task, loses some of the general capabilities it possessed before fine-tuning. The mechanism is straightforward: the parameter updates that teach the model the new task overwrite some of the parameter configurations that supported pre-existing capabilities. The result is a model that is better at the fine-tuning task and worse at other tasks it previously handled well.

For enterprise programs, catastrophic forgetting shows up in ways that are not always immediately obvious. A model fine-tuned on legal document analysis may become noticeably worse at general reasoning tasks that legal work occasionally requires. A model fine-tuned on customer service responses may lose some of its ability to handle the off-script queries that make up a meaningful fraction of real customer interactions. A model fine-tuned on a narrow set of document formats may fail to handle format variations that it would have managed competently before fine-tuning. These regressions are often discovered after deployment, when users encounter cases that the evaluation framework did not cover.

Why Parameter-Efficient Fine-Tuning Does Not Fully Solve the Problem

Parameter-efficient fine-tuning approaches, which modify only a small fraction of the model’s parameters while keeping the rest frozen, are often presented as a solution to catastrophic forgetting. The intuition is that smaller parameter changes mean less disruption to pre-existing capabilities. This intuition is partially correct but overstated. Research across multiple model families has demonstrated that even low-rank adaptation methods, which are among the most parameter-efficient approaches available, can produce significant forgetting on tasks that differ from the fine-tuning distribution, particularly when fine-tuning datasets are small and the fine-tuning task is narrow.

There is also a specific forgetting risk that receives less attention in enterprise contexts: the erosion of safety behaviors. Models that have been trained with safety guardrails through preference optimization can lose those guardrails when fine-tuned on datasets that do not reinforce them. An enterprise fine-tuning project that improves task performance while inadvertently degrading safety behavior has created a production risk that may not surface in standard evaluation until it produces a visible failure.

Managing Forgetting Through Dataset Design

The most practical mitigation for catastrophic forgetting in enterprise fine-tuning is dataset design rather than algorithm selection. Including a representative sample of general task examples alongside domain-specific examples in the fine-tuning dataset, sometimes called experience replay or rehearsal, helps preserve the parameter configurations that support general capabilities.

Including examples that exercise the model’s safety behaviors alongside domain task examples helps preserve those behaviors. The tradeoff is that a more diverse fine-tuning dataset requires more careful curation and a larger annotation investment. Human-in-the-loop approaches to building generative AI datasets that include deliberate coverage of both domain-specific and general behavioral requirements produce fine-tuning datasets that are less likely to create the forgetting regressions that teams discover in production.

The Evaluation Problem: Measuring the Wrong Thing

Why Benchmark Performance Does Not Predict Production Performance

The evaluation framework used for a fine-tuning project determines what the project appears to achieve. Teams that evaluate their fine-tuned model against a benchmark constructed from the same distribution as the training data will consistently find that their model performs well. Teams that evaluate against production inputs, including the edge cases, the unusual phrasings, the ambiguous requests, and the off-task queries that real users generate, will find a different picture. The gap between these two pictures is the gap between benchmark performance and production performance, and it is one of the most reliable explanations for why fine-tuning projects that look successful in development underperform in deployment.

The construction of the evaluation set is the most consequential methodological decision in a fine-tuning program. An evaluation set drawn from the same source as the training data, or constructed by the same team with the same selection criteria, will not reveal the distribution gaps and edge case failures that determine real-world performance. An evaluation set that is constructed independently, drawn from actual production inputs, and includes deliberate coverage of the cases the team is most uncertain about is significantly more predictive of deployment performance. Model evaluation services that maintain methodological independence between the fine-tuning program and the evaluation framework are a structural requirement for getting an honest picture of what the fine-tuned model actually delivers.

The Missing Behavioral Dimensions in Standard Evaluation

Standard fine-tuning evaluations typically measure task accuracy on held-out examples from the training distribution. What they rarely measure is behavioral consistency across rephrased inputs, robustness to adversarial or unusual inputs, calibration of confidence alongside accuracy, behavior under out-of-distribution conditions, and adherence to the safety and compliance behaviors the model is expected to maintain. Each of these dimensions can reveal failures that task accuracy does not capture.

Behavioral consistency is particularly important for enterprise deployments. A customer service model that gives different answers to semantically equivalent questions phrased differently is producing a user experience problem that accuracy metrics on a fixed test set will not reveal. A compliance-sensitive application that behaves correctly on standard inputs but incorrectly on slight rephrasings has a reliability problem that only behavioral consistency testing will surface. 

Building these dimensions into the evaluation framework from the start of the project, rather than adding them after a deployment failure draws attention to them, is one of the clearest differences between fine-tuning programs that deliver on their promises and those that do not.

Human Evaluation and Where It Cannot Be Replaced

Automated metrics capture some dimensions of output quality and miss others. For tasks where quality is partially subjective, where the correct answer depends on context that is difficult to encode in a metric, or where the model’s behavior needs to meet standards that are easier to recognize than to specify, human evaluation is not supplementary to automated metrics. It is the primary signal. Human preference optimization approaches that systematically collect and incorporate human quality judgments produce evaluation signals that automated metrics cannot replicate, and they are particularly important for catching the behavioral failures that look fine on paper but produce poor experiences when encountered by actual users.

Confusing Fine-Tuning With the Right Solution

When RAG Should Have Been the Answer

One of the most common patterns in enterprise fine-tuning projects that underdeliver is that fine-tuning was the answer to a question that was better answered by retrieval-augmented generation. Fine-tuning teaches a model behavioral patterns and stylistic preferences. It does not give a model reliable access to specific current facts, internal documents, or proprietary information that changes frequently. 

An enterprise that wants its language model to answer accurately about current product specifications, internal policy documents, or recent organizational decisions is unlikely to achieve that through fine-tuning, because fine-tuning encodes statistical patterns from training examples rather than providing a queryable knowledge store. RAG systems that retrieve relevant document chunks at inference time and condition the model’s response on retrieved context are a more appropriate architecture for this category of task, and deploying fine-tuning for it will produce a model that occasionally generates plausible-sounding but incorrect information derived from stale training patterns.

When Prompt Engineering Should Have Come First

Fine-tuning is also regularly deployed as a solution to problems that careful prompt engineering would have resolved at a fraction of the cost. A model that produces outputs in the wrong format when prompted naively may produce the correct format when given a well-structured system prompt with clear instructions and representative examples. A model that uses incorrect terminology when instructed generically may use the correct terminology when provided with a domain glossary in context. 

Prompt engineering services that systematically test the performance improvement achievable through prompt design before committing to a fine-tuning program are a practical and cost-effective step that many projects skip in their eagerness to begin training. The performance ceiling for well-engineered prompts on a capable base model is often higher than teams expect, and establishing that ceiling provides a realistic baseline for evaluating whether fine-tuning delivers meaningful incremental improvement.

The Organizational Assumption That Fine-Tuning Is a One-Time Event

A final underappreciated source of underdelivery is the organizational treatment of fine-tuning as a one-time project rather than a continuous lifecycle. A fine-tuned model that is deployed and left unchanged will experience performance degradation as the production data distribution shifts, as user needs evolve, as new domain terminology emerges, and as the base model it was derived from is updated. 

The initial fine-tuning project is the beginning of a model maintenance commitment, not the end of a capability acquisition effort. Programs that plan and budget for ongoing evaluation, data collection, and re-tuning cycles consistently outperform programs that treat the initial deployment as the finish line.

The Data Flywheel: Why Production Deployment Should Feed Back Into Training

Using Deployment Data to Improve Fine-Tuning Quality

The most valuable source of fine-tuning data for an enterprise model is not a manually curated dataset assembled before training. It is the production data generated by deploying the model and observing how it behaves on real inputs. Production data contains the actual distribution of inputs the model encounters, including the edge cases and unusual patterns that pre-deployment data collection typically underrepresents. It also contains the model’s failures, which are more informative for fine-tuning improvement than its successes.

Building a feedback loop between production deployment and the fine-tuning data pipeline, where failures are flagged, reviewed, corrected by subject matter experts, and incorporated into subsequent training rounds, is the mechanism that transforms a one-time fine-tuning project into a model that continuously improves against the actual production task. This feedback loop requires monitoring infrastructure to detect failures, review workflows to process flagged outputs, and annotation capacity to produce corrected examples at the rate the production system generates failures. Teams that build this infrastructure as part of the initial program design are significantly better positioned than those that attempt to add it retrospectively.

Active Learning and Prioritizing Annotation Effort

Not all production inputs are equally informative for fine-tuning improvement. Inputs on which the model produces confident, correct outputs contribute little to the next training round. Inputs on which the model is uncertain, incorrect, or inconsistent are the most valuable targets for human review and correction. Active learning approaches that prioritize annotation effort toward the most informative examples, rather than randomly sampling from the production stream, produce higher-quality fine-tuning datasets per annotation hour and deliver faster performance improvement per training cycle.

What a Fine-Tuning Project That Delivers Actually Looks Like

The Preconditions That Predict Success

Fine-tuning projects that deliver on their performance goals share a set of preconditions that projects that underdeliver typically lack. The use case has a clear, consistent structure that can be demonstrated through examples. The performance gap between the base model and the target is primarily a matter of style, domain register, or output format rather than factual knowledge. The evaluation framework measures production-relevant behavior rather than benchmark performance on training-distribution examples. The training dataset is small, clean, and highly representative of the production task rather than large, inconsistent, and assembled from whatever data was available. And the team has established clear baselines through prompt engineering before committing resources to fine-tuning.

The Program Architecture That Supports Sustained Performance

Beyond the initial project, the organizational architecture that supports sustained fine-tuning performance includes monitoring infrastructure to detect production failures and distribution shift, annotation capacity to process flagged outputs and produce corrected training examples, a regular re-tuning cycle that keeps the model current with production data distribution, and an evaluation framework that runs on each model version to catch regressions before deployment. Agentic AI systems that incorporate LLMs into complex workflows place additional demands on this architecture because failures in fine-tuned components can compound across the workflow in ways that are harder to diagnose than failures in standalone model deployments.

How Digital Divide Data Can Help

Digital Divide Data provides the data quality, annotation, and evaluation infrastructure that enterprise LLM fine-tuning programs need to deliver on their performance goals rather than falling into the familiar patterns of underperformance. The approach is built around the recognition that fine-tuning outcomes are primarily determined upstream and downstream of the training run itself, and that the training algorithm is rarely the limiting factor.

On the data side, DDD’s data collection and curation services are designed to produce fine-tuning datasets that are genuinely representative of the production task, consistent in quality across all examples, and diverse enough to cover the distribution the model will encounter in deployment. Dataset design explicitly addresses the coverage of edge cases, behavioral consistency requirements, and safety-relevant examples that standard data assembly processes tend to underweight.

On the evaluation side, our model evaluation services provide the methodological independence between the fine-tuning program and the evaluation framework that is necessary for an honest assessment of production performance. Evaluation frameworks are designed to cover production-relevant behavior, including edge cases, behavioral consistency, safety adherence, and out-of-distribution robustness, rather than focusing exclusively on benchmark accuracy.

For programs working with human preference optimization to align fine-tuned models with quality and safety requirements, RLHF and DPO data services provide the human quality signal that automated metrics cannot supply. For teams designing the fine-tuning data pipeline to incorporate production feedback, DDD’s active learning-informed annotation workflows ensure that human review effort is directed toward the examples that most improve model performance rather than spread uniformly across a production stream.

Build fine-tuning programs that actually close the performance gap. Talk to an Expert!

Conclusion

The underdelivery pattern in enterprise LLM fine-tuning is not a mystery. It follows predictably from a set of recurring errors: training data that is inconsistent, unrepresentative, or assembled from whatever was available rather than what was needed; evaluation frameworks that measure benchmark performance rather than production-relevant behavior; catastrophic forgetting that erodes general capabilities and safety behaviors in ways that standard evaluation does not detect; and organizational assumptions about fine-tuning that treat it as a one-time project rather than a continuous lifecycle. Each of these errors has a solution that is known, practical, and implementable without heroic engineering effort. The programs that deliver on their fine-tuning goals are not those that have access to better algorithms. They are those who treat data quality, evaluation rigor, and lifecycle planning with the same seriousness that they bring to model selection and training infrastructure.

For enterprise leaders evaluating their AI investment, the practical implication is that the return on a fine-tuning program is more sensitive to the quality of the data and evaluation infrastructure than to the choice of base model or fine-tuning technique. Investing in those foundations, through structured data curation, production-representative evaluation, and ongoing annotation capacity, is the most reliable lever for closing the gap between the performance that fine-tuning promises and the performance that production deployments actually need. 

Digital Divide Data is built to provide exactly that infrastructure, ensuring that the fine-tuning investment produces models that perform in deployment, not just in development.

References 

Raj J, M., Warrier, H., Desai, A., & Menon, S. (2024). Fine-tuning LLM for enterprise: Practical guidelines and recommendations. arXiv. https://arxiv.org/abs/2404.10779

Li, H., Ding, L., Fang, M., & Tao, D. (2024). Revisiting catastrophic forgetting in large language model tuning. Findings of EMNLP 2024. Association for Computational Linguistics. https://aclanthology.org/2024.findings-emnlp.249

Biderman, S., Portes, J., Ortiz, J. J., Paul, M., Greengard, A., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., & Cunningham, J. P. (2024). LoRA learns less and forgets less. Transactions on Machine Learning Research. https://arxiv.org/abs/2405.09673

VentureBeat. (2025, February). MIT’s new fine-tuning method lets LLMs learn new skills without losing old ones. VentureBeat. https://venturebeat.com/orchestration/mits-new-fine-tuning-method-lets-llms-learn-new-skills-without-losing-old

Frequently Asked Questions

How much training data does an enterprise LLM fine-tuning project typically need?

A few hundred to a few thousand high-quality, task-representative examples are often sufficient for meaningful fine-tuning improvement; volume matters less than quality and representativeness of the production distribution.

What is catastrophic forgetting, and how does it affect enterprise models?

Catastrophic forgetting occurs when fine-tuning on a specific task overwrites parameter configurations supporting other capabilities, causing the model to perform worse on tasks it handled well before fine-tuning, including general reasoning and safety behaviors.

When should an enterprise choose RAG over fine-tuning?

RAG is more appropriate when the task requires access to specific, current, or frequently updated factual information, since fine-tuning encodes behavioral patterns rather than providing reliable access to specific knowledge.

How do you build an evaluation framework that reflects production performance?

Draw the evaluation set from actual production inputs rather than the same source as training data, include deliberate coverage of edge cases and behavioral consistency, and maintain methodological independence between the team building the fine-tuning dataset and the team constructing the evaluation set.

Why Most Enterprise LLM Fine-Tuning Projects Underdeliver Read Post »

Retrieval-augmented,Generation,(rag)

RAG Detailed Guide: Data Quality, Evaluation, and Governance

Retrieval Augmented Generation (RAG) is often presented as a simple architectural upgrade: connect a language model to a knowledge base, retrieve relevant documents, and generate grounded answers. In practice, however, most RAG systems fail not because the idea is flawed, but because they are treated as lightweight retrieval pipelines rather than full-fledged information systems.

When answers go wrong, teams frequently adjust prompts, swap models, or tweak temperature settings. Yet in enterprise environments, the real issue usually lies upstream. Incomplete repositories, outdated policies, inconsistent formatting, duplicated files, noisy OCR outputs, and poorly defined access controls quietly shape what the model is allowed to “know.” The model can only reason over the context it receives. If that context is fragmented, stale, or irrelevant, even the most advanced LLM will produce unreliable results.

In this article, let’s explore how Retrieval Augmented Generation or RAG should be treated not as a retrieval pipeline, but as a data system, an evaluation system, and a governance system.

Data Quality: The Foundation Of RAG Performance

There is a common instinct to blame the model when RAG answers go wrong. Maybe the prompt was weak. Maybe the model was too small. Maybe the temperature was set incorrectly. In many enterprise cases, however, the failure is upstream. The language model is responding to what it sees. If what it sees is incomplete, outdated, fragmented, or irrelevant, the answer will reflect that.

RAG systems fail more often due to poor data engineering than poor language models. When teams inherit decades of documents, they also inherit formatting inconsistencies, duplicates, version sprawl, and embedded noise. Simply embedding everything and indexing it does not transform it into knowledge. It transforms it into searchable clutter. Before discussing chunking or embeddings, it helps to define what data quality means in the RAG context.

Data Quality Dimensions in RAG

Data quality in RAG is not abstract. It can be measured and managed.

Completeness
Are all relevant documents present? If your knowledge base excludes certain product manuals or internal policies, retrieval will never surface them. Completeness also includes coverage of edge cases. For example, do you have archived FAQs for discontinued products that customers still ask about?

Freshness
Are outdated documents removed or clearly versioned? A single outdated HR policy in the index can generate incorrect advice. Freshness becomes more complex when departments update documents independently. Without active lifecycle management, stale content lingers.

Consistency
Are formats standardized? Mixed encodings, inconsistent headings, and different naming conventions may not matter to humans browsing folders. They matter to embedding models and search filters.

Relevance Density
Does each chunk contain coherent semantic information? A chunk that combines a privacy disclaimer, a table of contents, and a partial paragraph on pricing is technically valid. It is not useful.

Noise Ratio
How much irrelevant content exists in the index? Repeated headers, boilerplate footers, duplicated disclaimers, and template text inflate the search space and dilute retrieval quality.

If you think of RAG as a question answering system, these dimensions determine what the model is allowed to know. Weak data quality constrains even the best models.

Document Ingestion: Cleaning Before Indexing

Many RAG projects begin by pointing a crawler at a document repository and calling it ingestion. The documents are embedded. A vector database is populated. A demo is built. Weeks later, subtle issues appear.

Handling Real World Enterprise Data

Enterprise data is rarely clean. PDFs contain tables that do not parse correctly. Scanned documents require optical character recognition and may include recognition errors. Headers and footers repeat across every page. Multiple versions of the same file exist with names like “Policy_Final_v3_revised2.”

In multilingual organizations, documents may switch languages mid-file. A support guide may embed screenshots with critical instructions inside images. Legal documents may include annexes appended in different formats.

Even seemingly small issues can create disproportionate impact. For example, repeated footer text such as “Confidential – Internal Use Only” embedded across every page becomes semantically dominant in embeddings. Retrieval may match on that boilerplate instead of meaningful content.

Duplicate versions are another silent problem. If three versions of the same policy are indexed, retrieval may surface the wrong one. Without clear version tagging, the model cannot distinguish between active and archived content. These challenges are not edge cases. They are the norm.

Pre-Processing Best Practices

Pre-processing should be treated as a controlled pipeline, not an ad hoc script.

OCR normalization should standardize extracted text. Character encoding issues need resolution. Tables require structure-aware parsing so that rows and columns remain logically grouped rather than flattened into confusing strings. Metadata extraction is critical. Every document should carry attributes such as source repository, timestamp, department, author, version, and access level. This metadata is not decorative. It becomes the backbone of filtering and governance later.

Duplicate detection algorithms can identify near-identical documents based on hash comparisons or semantic similarity thresholds. When duplicates are found, one version should be marked authoritative, and others archived or excluded. Version control tagging ensures that outdated documents are clearly labeled and can be excluded from retrieval when necessary.

Chunking Strategies

Chunking may appear to be a technical parameter choice. In practice, it is one of the most influential design decisions in a RAG system.

Why Chunking Is Not a Trivial Step

If chunks are too small, context becomes fragmented. The model may retrieve one paragraph without the surrounding explanation. Answers then feel incomplete or overly narrow. If chunks are too large, tokens are wasted. Irrelevant information crowds the context window. The model may struggle to identify which part of the chunk is relevant.

Misaligned boundaries introduce semantic confusion. Splitting a policy in the middle of a conditional statement may lead to the retrieval of a clause without its qualification. That can distort the meaning entirely. I have seen teams experiment with chunk sizes ranging from 200 tokens to 1500 tokens without fully understanding why performance changed. The differences were not random. They reflected how well chunks aligned with the semantic structure.

Chunking Techniques

Several approaches exist, each with tradeoffs. Fixed-length chunking splits documents into equal-sized segments. It is simple but ignores structure. It may work for uniform documents, but it often performs poorly on complex policies. Recursive semantic chunking attempts to break documents along natural boundaries such as headings and paragraphs. It requires more preprocessing logic but typically yields higher coherence.

Section-aware chunking respects document structure. For example, an entire “Refund Policy” section may become a chunk, preserving logical completeness. Hierarchical chunking allows both coarse and fine-grained retrieval. A top-level section can be retrieved first, followed by more granular sub-sections if needed.

Table-aware chunking ensures that rows and related cells remain grouped. This is particularly important for pricing matrices or compliance checklists. No single technique fits every corpus. The right approach depends on document structure and query patterns.

Chunk Metadata as a Quality Multiplier

Metadata at the chunk level can significantly enhance retrieval. Each chunk should include document ID, version number, access classification, semantic tags, and potentially embedding confidence scores. When a user from the finance department asks about budget approvals, metadata filtering can prioritize finance-related documents. If a document is marked confidential, it can be excluded from users without proper clearance.

Embedding confidence or quality indicators can flag chunks generated from low-quality OCR or incomplete parsing. Those chunks can be deprioritized or reviewed. Metadata also improves auditability. If an answer is challenged, teams can trace exactly which chunk was used, from which document, and at what version. Without metadata, the index is flat and opaque. With metadata, it becomes navigable and controllable.

Embeddings and Index Design

Embeddings translate text into numerical representations. The choice of embedding model and index architecture influences retrieval quality and system performance.

Embedding Model Selection Criteria

A general-purpose embedding model may struggle with highly technical terminology in medical, legal, or engineering documents. Multilingual support becomes important in global organizations. If queries are submitted in one language but documents exist in another, cross-lingual alignment must be reliable. Latency constraints also influence model selection. Higher-dimensional embeddings may improve semantic resolution but increase storage and search costs.

Dimensionality tradeoffs should be evaluated in context. Larger vectors may capture nuance but can slow retrieval. Smaller vectors may improve speed but reduce semantic discrimination. Embedding evaluation should be empirical rather than assumed. Test retrieval performance across representative queries.

Index Architecture Choices

Vector databases provide efficient similarity search. Hybrid search combines dense embeddings with sparse keyword-based retrieval. In many enterprise settings, hybrid approaches improve performance, especially when exact terms matter.

Re-ranking layers can refine top results. A first stage retrieves candidates. A second stage re ranks based on deeper semantic comparison or domain-specific rules. Filtering by metadata allows role-based retrieval and contextual narrowing. For example, limiting the search to a particular product line or region. Index architecture decisions shape how retrieval behaves under real workloads. A simplistic setup may work in a prototype but degrade as corpus size and user complexity grow.

Retrieval Failure Modes

Semantic drift occurs when embeddings cluster content that is conceptually related but not contextually relevant. For example, “data retention policy” and “retention bonus policy” may appear semantically similar but serve entirely different intents. Keyword mismatch can cause dense retrieval to miss exact terminology that sparse search would capture.

Over-broad matches retrieve large numbers of loosely related chunks, overwhelming the generation stage. Context dilution happens when too many marginally relevant chunks are included, reducing answer clarity.

To make retrieval measurable, organizations can define a Retrieval Quality Score. RQS can be conceptualized as a weighted function of precision, recall, and contextual relevance. By tracking RQS over time, teams gain visibility into whether retrieval performance is improving or degrading.

Evaluation: Making RAG Measurable

Standard text generation metrics such as BLEU or ROUGE were designed for machine translation and summarization tasks. They compare the generated text to a reference answer. RAG systems are different. The key question is not whether the wording matches a reference, but whether the answer is faithful to the retrieved content.

Traditional metrics do not evaluate retrieval correctness. They do not assess whether the answer cites the appropriate document. They cannot detect hallucinations that sound plausible. RAG requires multi-layer evaluation. Retrieval must be evaluated separately from generation. Then the entire system must be assessed holistically.

Retrieval Level Evaluation

Retrieval evaluation focuses on whether relevant documents are surfaced. Metrics include Precision at K, Recall at K, Mean Reciprocal Rank, context relevance scoring, and latency. Precision at K measures how many of the top K retrieved chunks are truly relevant. Recall at K measures whether the correct document appears in the retrieved set.

Gold document sets can be curated by subject matter experts. For example, for 200 representative queries, experts identify the authoritative documents. Retrieval results are then compared against this set. Synthetic query generation can expand test coverage. Variations of the same intent help stress test retrieval robustness.

Adversarial queries probe edge cases. Slightly ambiguous or intentionally misleading queries test whether retrieval resists drift. Latency is also part of retrieval quality. Even perfectly relevant results are less useful if retrieval takes several seconds.

Generation Level Evaluation

Generation evaluation examines whether the model uses the retrieved context accurately. Metrics include faithfulness to context, answer relevance, hallucination rate, citation correctness, and completeness. Faithfulness measures whether claims in the answer are directly supported by retrieved content. Answer relevance checks whether the response addresses the user’s question.

Hallucination rate can be estimated by comparing answer claims against the source text. Citation correctness ensures references point to the right documents and sections. LLM as a judge approach may assist in automated scoring, but human evaluation loops remain important. Subject matter experts can assess subtle errors that automated systems miss. Edge case testing is critical. Rare queries, multi-step reasoning questions, and ambiguous prompts often expose weaknesses.

System Level Evaluation

System-level evaluation considers the end-to-end experience. Does the answer satisfy the user? Is domain-specific correctness high? What is the cost per query? How does throughput behave under load? User satisfaction surveys and feedback loops provide qualitative insight. Logs can reveal patterns of dissatisfaction, such as repeated rephrasing of queries.

Cost per query matters in production environments. High embedding costs or excessive context windows may strain budgets. Throughput under load indicates scalability. A system that performs well in testing may struggle during peak usage.

A Composite RAG Quality Index can aggregate retrieval, generation, and system metrics into a single dashboard score. While simplistic, such an index helps executives track progress without diving into granular details.

Building an Evaluation Pipeline

Evaluation should not be a one-time exercise.

Offline Evaluation

Offline evaluation uses benchmark datasets and regression testing before deployment. Whenever chunking logic, embedding models, or retrieval parameters change, retrieval and generation metrics should be re-evaluated. Automated scoring pipelines allow rapid iteration. Changes that degrade performance can be caught early.

Online Evaluation

Online evaluation includes A B testing retrieval strategies, shadow deployments that compare outputs without affecting users, and canary testing for gradual rollouts. Real user queries provide more diverse coverage than synthetic tests.

Continuous Monitoring

After deployment, monitoring should track drift in embedding distributions, drops in retrieval precision, spikes in hallucination rates, and latency increases. A Quality Gate Framework for CI CD can formalize deployment controls. Each new release must pass defined thresholds:

  • Retrieval threshold
  • Faithfulness threshold
  • Governance compliance check

Why RAG Governance Is Unique

Unlike standalone language models, RAG systems store and retrieve enterprise knowledge. They dynamically expose internal documents. They combine user input with sensitive data. Governance must therefore span data governance, model governance, and access governance.

If governance is an afterthought, the system may inadvertently expose confidential information. Even if the model is secure, retrieval bypass can surface restricted documents.

Data Classification

Documents should be classified as Public, Internal, Confidential, or Restricted. Classification integrates directly into index filtering and access controls. When a user submits a query, retrieval must consider their clearance level. Classification also supports retrieval constraints. For example, external customer-facing systems should never access internal strategy documents.

Access Control in Retrieval

Role-based access control assigns permissions based on job roles. Attribute-based access control incorporates contextual attributes such as department, region, or project assignment. Document-level filtering ensures that unauthorized documents are never retrieved. Query time authorization verifies access rights dynamically. Retrieval bypass is a serious risk. Even if the generation model does not explicitly expose confidential information, the act of retrieving restricted documents into context may constitute a policy violation.

Data Lineage and Provenance

Every answer should be traceable. Track document source, version history, embedding timestamp, and index update logs. Audit trails support compliance and incident investigation. If a user disputes an answer, teams should be able to identify exactly which document version informed it. Without lineage, accountability becomes difficult. In regulated industries, that may be unacceptable.

Conclusion

RAG works best when you stop treating it like a clever retrieval add-on and start treating it like a knowledge infrastructure that has to behave predictably under pressure. The uncomfortable truth is that most “RAG problems” are not model problems. They are data problems that show up as retrieval mistakes, and evaluation problems that go unnoticed because no one is measuring the right things. 

Once you enforce basic hygiene in ingestion, chunking, metadata, and indexing, the system usually becomes calmer. Answers get more stable, the model relies less on guesswork, and teams spend less time chasing weird edge cases that were baked into the corpus from day one.

Governance is what turns that calmer system into something people can actually trust. Access control needs to happen at retrieval time, provenance needs to be traceable, and quality checks need to be part of releases, not a reaction to incidents. 

None of this is glamorous work, and it may feel slower than shipping a demo. Still, it is the difference between a tool that employees cautiously ignore and a system that becomes part of daily operations. If you build around data quality, continuous evaluation, and clear governance controls, RAG stops being a prompt experiment and starts looking like a dependable way to deliver the right information to the right person at the right time.

How Digital Divide Data Can Help

Digital Divide Data brings domain-aware expertise into every stage of the RAG data pipeline, from structured data preparation to ongoing human-in-the-loop evaluation. Teams trained in subject matter nuance help ensure that retrieval systems surface contextually correct and relevant information, reducing the kind of hallucinated or misleading responses that erode user trust.

This approach is especially valuable in high-stakes environments like healthcare and legal research, where specialized terminology and subtle semantic differences matter more than textbook examples. For teams looking to move RAG from experimentation to trusted production use, DDD offers both the technical discipline and the people-centric approach that make that transition practical and sustainable. 

Partner with DDD to build RAG systems that are accurate, measurable, and governance-ready from day one.

References

National Institute of Standards and Technology. (2024). Artificial Intelligence Risk Management Framework: Generative AI Profile. https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence

European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

European Data Protection Supervisor. (2024). TechSonar: Retrieval Augmented Generation. https://www.edps.europa.eu/data-protection/technology-monitoring/techsonar/retrieval-augmented-generation-rag_en

Microsoft Azure Architecture Center. (2025). Retrieval augmented generation guidance. https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag

Amazon Web Services. (2025). Building secure retrieval augmented generation applications. https://aws.amazon.com/blogs/machine-learning

FAQs

  1. How often should a RAG index be refreshed?
    It depends on how frequently underlying documents change. In fast-moving environments such as policy or pricing updates, weekly or even daily refresh cycles may be appropriate. Static archives may require less frequent updates.
  2. Can RAG eliminate hallucination?
    Not entirely. RAG reduces hallucination risk by grounding responses in retrieved documents. However, generation errors can still occur if context is misinterpreted or incomplete.
  3. Is hybrid search always better than pure vector search?
    Not necessarily. Hybrid search often improves performance in terminology-heavy domains, but it adds complexity. Empirical testing with representative queries should guide the choice.
  4. What is the highest hidden cost in RAG systems?
    Data cleaning and maintenance. Ongoing ingestion, version control, and evaluation pipelines often require sustained operational investment.
  5. How do you measure user trust in a RAG system?
    User feedback rates, query repetition patterns, citation click-through behavior, and survey responses can provide signals of trust and perceived reliability.

 

RAG Detailed Guide: Data Quality, Evaluation, and Governance Read Post »

Scroll to Top