A generative AI model does not reveal its failure modes in normal operation. Standard evaluation benchmarks measure what a model does when it receives well-formed, expected inputs. They say almost nothing about what happens when the inputs are adversarial, manipulative, or designed to bypass the model’s safety training. The only way to discover those failure modes before deployment is to deliberately look for them. That is what red teaming does, and it has become a non-negotiable step in the safety workflow for any GenAI system intended for production use.
In the context of large language models, red teaming means generating inputs specifically designed to elicit unsafe, harmful, or policy-violating outputs, documenting the failure modes that emerge, and using that evidence to improve the model through additional training data, safety fine-tuning, or system-level controls. The adversarial inputs produced during red teaming are themselves a form of training data when they feed back into the model’s safety tuning.
This blog examines how red teaming works as a data discipline for GenAI, with trust and safety solutions and model evaluation services as the two capabilities most directly implicated in operationalizing it at scale.
Key Takeaways
- Red teaming produces the adversarial data that safety fine-tuning depends on. Without it, a model is only as safe as the scenarios its developers thought to include in standard training.
- Effective red teaming requires human creativity and domain knowledge, not just automated prompt generation. Automated tools cover known attack patterns; human red teamers find the novel ones.
- The outputs of red teaming, documented attack prompts, model responses, and failure classifications, become training data for safety tuning when curated and labeled correctly.
- Red teaming is not a one-time exercise. Models change after fine-tuning, and new attack techniques emerge continuously. Programs that treat red teaming as a pre-launch checkpoint rather than an ongoing process will accumulate safety debt.
Build the adversarial data and safety annotation programs that make GenAI deployment safe rather than just optimistic.
What Red Teaming Actually Tests
The Failure Modes Standard Evaluation Misses
Standard model evaluation measures performance on defined tasks: accuracy, fluency, factual correctness, and instruction-following. What it does not measure is robustness under adversarial pressure. The overview by Purpura et al. characterizes red teaming as proactively attacking LLMs with the purpose of identifying vulnerabilities, distinguishing it from standard evaluation precisely because its goal is to find what the model does wrong rather than to confirm what it does right. Failure modes that only appear under adversarial conditions include jailbreaks, where a model is induced to produce content its safety training should prevent; prompt injection, where malicious instructions embedded in user input override system-level controls; data extraction, where the model is induced to reproduce sensitive training data; and persistent harmful behavior that reappears after safety fine-tuning.
These failure modes matter operationally because they are the ones real-world adversaries will target. A model that performs well on standard benchmarks but succumbs to straightforward jailbreak techniques is not actually safe for deployment. The gap between benchmark performance and adversarial robustness is precisely the space that red teaming is designed to measure and close.
Categories of Adversarial Input
Red teaming for GenAI produces inputs across several categories of attack. Direct prompt injections attempt to override the model’s system instructions through user input. Jailbreaks use persona framing, fictional scenarios, or emotional manipulation to induce the model to bypass its safety training. Multi-turn attacks build context across a conversation to gradually shift model behavior in a harmful direction. Data extraction probes attempt to get the model to reproduce memorized training content. Indirect injections embed adversarial instructions within documents or retrieved content that the model processes.
How Red Teaming Produces Training Data
From Attack to Dataset
The outputs of a red teaming exercise have two uses. First, they reveal where the model currently fails, informing decisions about deployment readiness, system-level controls, and the scope of additional training. Second, when curated and annotated correctly, they become the adversarial training examples that safety fine-tuning requires. A model cannot learn to refuse a jailbreak it has never been trained to recognize. The red teaming process generates the specific failure examples that the safety training data needs to include.
The curation step is critical and is where red teaming intersects directly with data quality. Raw red teaming outputs, attack prompts, and model responses need to be reviewed, classified by failure type, and annotated to indicate the correct model behavior. An attack prompt that produced a harmful response needs to be paired with a refusal response that correctly handles it. That pair becomes a safety training example. The quality of the annotation determines whether the safety training actually teaches the model what to do differently, or simply adds noise. Building generative AI datasets with human-in-the-loop workflows covers how iterative human review is structured to convert raw adversarial outputs into training-ready examples.
The Role of Diversity in Adversarial Datasets
The most common failure in red teaming programs is insufficient diversity in the adversarial examples generated. If all the jailbreak attempts follow similar patterns, the safety training data will be dense around those patterns but sparse around the full space of adversarial inputs the model will encounter in production. A model trained on a narrow set of attack patterns learns to refuse those specific patterns rather than learning generalized robustness to adversarial pressure. Effective red teaming programs deliberately vary the attack vector, framing, cultural context, language, and level of directness across their adversarial examples to produce safety training data with genuine coverage.
Human Red Teamers vs. Automated Approaches
What Automated Tools Can and Cannot Do
Automated red teaming tools generate adversarial inputs at scale by using one model to attack another, applying known jailbreak templates, fuzzing prompts systematically, or combining attack techniques programmatically. These tools are valuable for covering large input spaces rapidly and for regression testing after safety updates to check that previously patched vulnerabilities have not reappeared. The Microsoft AI red team’s review of over 100 GenAI products notes that specialist areas, including medicine, cybersecurity, and chemical, biological, radiological, and nuclear risks, require subject-matter experts rather than automated tools because harm evaluation itself requires domain knowledge that LLM-based evaluators cannot reliably provide.
Automated tools are also limited to the attack patterns they have been programmed or trained to generate. The most novel and damaging attack techniques tend to be discovered by human red teamers who approach the model with genuine adversarial creativity rather than systematic application of known patterns. A program that relies entirely on automation will develop good defenses against known attacks while remaining vulnerable to the class of novel attacks that automated systems did not anticipate.
Building an Effective Red Team
Effective red teaming programs combine specialists with different skill profiles: people with security backgrounds who understand attack methodology, domain experts who can evaluate whether a model response is genuinely harmful in the relevant context, people with diverse cultural and linguistic backgrounds who can identify failures that appear in non-English or non-Western cultural contexts, and generalists who approach the model as a motivated but non-expert adversary would. The diversity of the red team determines the coverage of the adversarial dataset. A red team drawn from a narrow demographic or professional background will produce adversarial examples that reflect their particular perspective on what constitutes a harmful input, which is systematically narrower than the full range of inputs the deployed model will encounter.
Red Teaming and the Safety Fine-Tuning Loop
How Adversarial Data Feeds Back Into Training
The standard safety fine-tuning workflow treats red teaming outputs as one of the key data inputs. Adversarial examples that elicit harmful model responses are paired with human-written refusal responses and added to the safety training dataset. The model is then retrained or fine-tuned on this expanded dataset, and the red teaming exercise is repeated to verify that the patched failure modes have been addressed and that the patches have not introduced new failures. This iterative loop between adversarial discovery and safety training is sometimes called purple teaming, reflecting the combination of the offensive red team and the defensive blue team. Human preference optimization integrates directly with this loop: the preference data collected during RLHF includes human judgments of adversarial responses, which trains the model to prefer refusal over compliance in the scenarios the red team identified.
Safety Regression After Fine-Tuning
One of the most significant challenges in the red teaming loop is safety regression: fine-tuning a model on new domain data or for new capabilities can reduce its robustness to adversarial inputs that it previously handled correctly. A model safety-tuned at one stage of development may lose some of that robustness after subsequent fine-tuning for domain specialization. This means red teaming is needed not just before initial deployment but after every significant fine-tuning operation. Programs that run red teaming once and then repeatedly fine-tune the model without re-testing are building up safety debt that will only become visible after deployment.
How Digital Divide Data Can Help
Digital Divide Data provides adversarial data curation, safety annotation, and model evaluation services that support red teaming programs across the full loop from adversarial discovery to safety fine-tuning.
For programs building adversarial training datasets, trust and safety solutions cover the annotation of red teaming outputs: classifying failure types, pairing attack prompts with correct refusal responses, and quality-controlling the adversarial examples that feed into safety fine-tuning. Annotation guidelines are designed to produce consistent refusal labeling across adversarial categories rather than case-by-case human judgments that vary across annotators.
For programs evaluating model robustness before and after safety updates, model evaluation services design adversarial evaluation suites that cover the range of attack categories the deployed model will face, stratified by attack type, cultural context, and domain. Regression testing frameworks verify that safety fine-tuning has addressed identified failure modes without degrading model performance on legitimate use cases.
For programs building the preference data that RLHF safety tuning requires, human preference optimization services provide structured comparison annotation where human evaluators judge model responses to adversarial inputs, producing the preference signal that trains the model to prefer safe behavior under adversarial pressure. Data collection and curation services build the diverse adversarial input sets that coverage-focused red teaming programs need.
Build adversarial data and safety-annotation programs that make your GenAI deployment truly safe. Talk to an expert.
Conclusion
Red teaming closes the gap between what a model does under normal conditions and what it does when someone is actively trying to make it fail. The adversarial data produced through red teaming is not incidental to safety fine-tuning. It is the input that safety fine-tuning most depends on. A model trained without adversarial examples will have unknown safety properties under adversarial pressure. A model trained on a well-curated, diverse adversarial dataset will have measurable robustness to the failure modes that dataset covers. The quality of the red teaming program determines the quality of that coverage.
Programs that treat red teaming as an ongoing discipline rather than a pre-launch checkbox build cumulative safety knowledge. Each red teaming cycle produces better adversarial data than the last because the team learns which attack patterns the model is most vulnerable to and can design the next cycle to probe those areas more deeply. The compounding effect of iterative red teaming, safety fine-tuning, and re-evaluation is a model whose adversarial robustness improves continuously rather than degrading as capabilities grow. Building trustworthy agentic AI with human oversight examines how this discipline extends to agentic systems where the safety stakes are higher, and the adversarial surface is larger.
Author:
Kevin Sahotsky, Head of Go-to-Market and Strategic Partnerships, Digital Divide Data
Kevin Sahotsky leads strategic partnerships and go-to-market strategy at Digital Divide Data, with deep experience in AI data services and annotation for physical AI, autonomy programs, and Generative AI use cases. He works with enterprise teams to navigate the operational complexity of production AI, helping them connect the right data strategy to real-world model performance. At DDD, Kevin focuses on bridging what organizations need from their AI data operations with the delivery capability, domain expertise, and quality infrastructure to make it happen.
References
Purpura, A., Wadhwa, S., Zymet, J., Gupta, A., Luo, A., Rad, M. K., Shinde, S., & Sorower, M. S. (2025). Building safe GenAI applications: An end-to-end overview of red teaming for large language models. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025) (pp. 335-350). Association for Computational Linguistics. https://aclanthology.org/2025.trustnlp-main.23/
Microsoft Security. (2025, January 13). 3 takeaways from red teaming 100 generative AI products. Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2025/01/13/3-takeaways-from-red-teaming-100-generative-ai-products/
OWASP Foundation. (2025). OWASP top 10 for LLM applications. OWASP GenAI Security Project. https://genai.owasp.org/
European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. https://artificialintelligenceact.eu/
Frequently Asked Questions
Q1. What is the difference between red teaming and standard model evaluation?
Standard evaluation measures model performance on expected, well-formed inputs. Red teaming specifically generates adversarial, manipulative, or policy-violating inputs to find failure modes that only appear under adversarial conditions. The goal of red teaming is to find what goes wrong, not to confirm what goes right.
Q2. How do red teaming outputs become training data?
Attack prompts that produce harmful model responses are paired with human-written correct refusal responses and added to the safety fine-tuning dataset. The model is then retrained on this expanded dataset, and the red teaming exercise is repeated to verify that patched failure modes have been addressed without introducing new ones.
Q3. Can automated red teaming tools replace human red teamers?
Automated tools are valuable for scale and regression testing, but are limited to known attack patterns. Human red teamers find novel attack methods that automated systems did not anticipate. Effective programs combine automated coverage of known attack patterns with human creativity for novel discovery. Domain-specific harms in medicine, cybersecurity, and other specialist areas require human expert evaluation that automated tools cannot reliably provide.
Q4. How often should red teaming be conducted?
Red teaming should be conducted before initial deployment and after every significant fine-tuning operation, because domain fine-tuning can reduce safety robustness. Programs that treat red teaming as a one-time pre-launch activity accumulate safety debt as the model is updated over its deployment lifetime.