Skip to main content

AI Guard Models

Guard models are specialized classifiers designed to detect unsafe, harmful, or adversarial content within LLM pipelines. These models serve as protective layers either before or after a language model, depending on their function—moderating inputs, filtering outputs, or scanning both.

This section provides documentation and red teaming guides for evaluating a variety of guard models using the dtx redteam CLI.


Covered Models​

  • Meta Llama Guard — Input/output moderation across 11 safety categories
  • Meta Llama Prompt Guard — Prompt injection & jailbreak detection (22M & 86M variants)
  • IBM Granite Guardian — Trust, safety, and hallucination moderation for enterprise use
  • BERT Toxic — Toxicity detection using classic transformer classifiers
  • Open Moderation APIs — Support for external moderation endpoints (e.g., Perspective API)

Red Teaming with dtx​

Detoxio’s red teaming suite enables systematic evaluation of guard models using curated datasets and adversarial prompt generators.

Each guide includes:

  • Model purpose and scope
  • Example evaluation commands
  • JSONPath evaluator logic
  • Real-world attack success results

Evaluation Themes​

  • Prompt injection & jailbreak resistance
  • Toxic content classification
  • Output safety moderation
  • Adversarial robustness under obfuscation and evasion

Explore the individual model pages to learn how each guard works, how it fails, and how to test it effectively using red teaming strategies.