AI Guard Models
Guard models are specialized classifiers designed to detect unsafe, harmful, or adversarial content within LLM pipelines. These models serve as protective layers either before or after a language model, depending on their function—moderating inputs, filtering outputs, or scanning both.
This section provides documentation and red teaming guides for evaluating a variety of guard models using the dtx redteam
CLI.
Covered Models​
- Meta Llama Guard — Input/output moderation across 11 safety categories
- Meta Llama Prompt Guard — Prompt injection & jailbreak detection (22M & 86M variants)
- IBM Granite Guardian — Trust, safety, and hallucination moderation for enterprise use
- BERT Toxic — Toxicity detection using classic transformer classifiers
- Open Moderation APIs — Support for external moderation endpoints (e.g., Perspective API)
Red Teaming with dtx
​
Detoxio’s red teaming suite enables systematic evaluation of guard models using curated datasets and adversarial prompt generators.
Each guide includes:
- Model purpose and scope
- Example evaluation commands
- JSONPath evaluator logic
- Real-world attack success results
Evaluation Themes​
- Prompt injection & jailbreak resistance
- Toxic content classification
- Output safety moderation
- Adversarial robustness under obfuscation and evasion
Explore the individual model pages to learn how each guard works, how it fails, and how to test it effectively using red teaming strategies.