AI Guard Models

Guard models are specialized classifiers designed to detect unsafe, harmful, or adversarial content within LLM pipelines. These models serve as protective layers either before or after a language model, depending on their function—moderating inputs, filtering outputs, or scanning both.

This section provides documentation and red teaming guides for evaluating a variety of guard models using the dtx redteam CLI.

Covered Models

Meta Llama Guard — Input/output moderation across 11 safety categories
Meta Llama Prompt Guard — Prompt injection & jailbreak detection (22M & 86M variants)
IBM Granite Guardian — Trust, safety, and hallucination moderation for enterprise use
BERT Toxic — Toxicity detection using classic transformer classifiers
Open Moderation APIs — Support for external moderation endpoints (e.g., Perspective API)

Red Teaming with `dtx`

Detoxio’s red teaming suite enables systematic evaluation of guard models using curated datasets and adversarial prompt generators.

Each guide includes:

Model purpose and scope
Example evaluation commands
JSONPath evaluator logic
Real-world attack success results

Evaluation Themes

Prompt injection & jailbreak resistance
Toxic content classification
Output safety moderation
Adversarial robustness under obfuscation and evasion

Explore the individual model pages to learn how each guard works, how it fails, and how to test it effectively using red teaming strategies.

Covered Models​

Red Teaming with dtx​

Evaluation Themes​

Covered Models

Red Teaming with `dtx`

Evaluation Themes