AI Guard Models
Guard models are specialized classifiers designed to detect unsafe, harmful, or adversarial content within LLM pipelines. These models serve as protective layers either before or after a language model, depending on their function—moderating inputs, filtering outputs, or scanning both.
This section provides documentation and red teaming guides for evaluating a variety of guard models using the dtx redteam
CLI.
Covered Models
- Meta Llama Guard — Input/output moderation across 11 safety categories
- Meta Llama Prompt Guard — Prompt injection & jailbreak detection (22M & 86M variants)
- IBM Granite Guardian — Trust, safety, and hallucination moderation for enterprise use
- BERT Toxic — Toxicity detection using classic transformer classifiers
- Open Moderation APIs — Support for external moderation endpoints (e.g., Perspective API)
Red Teaming with dtx
Detoxio’s red teaming suite enables systematic evaluation of guard models using curated datasets and adversarial prompt generators.
Each guide includes:
- Model purpose and scope
- Example evaluation commands
- JSONPath evaluator logic
- Real-world attack success results
Evaluation Themes
- Prompt injection & jailbreak resistance
- Toxic content classification
- Output safety moderation
- Adversarial robustness under obfuscation and evasion
Explore the individual model pages to learn how each guard works, how it fails, and how to test it effectively using red teaming strategies.