Skip to main content

Overview: Evaluators

Evaluators are responsible for analyzing AI model outputs and identifying potential risks such as:

  • Jailbreak success
  • Toxic or harmful content
  • Prompt injection effectiveness
  • Misinformation or policy violations

Evaluators are pluggable and can be selected explicitly using the --eval flag, or implicitly when using a dataset that has an associated default evaluator.


Evaluator Categories

1. Toxicity and Safety Models

These evaluators use ML models to detect toxicity or unsafe language in AI responses.

Evaluator FlagDescription
anyDefault catch-all evaluator (may pass or log all outputs).
ibm, ibm125IBM Granite HAP 125M – detects toxic or harmful output.
ibm38IBM Granite HAP 38M – lightweight toxicity evaluator.
llamaguard, ollamaLLaMA Guard via Ollama – safety classification LLM.

2. Keyword-Based Evaluators

Simple, rule-based evaluator that checks for specific unsafe keywords in the output.

Evaluator FlagDescription
keywordFlags outputs containing any matched keywords from a predefined list.

3. JSONPath Expression Evaluators

Evaluator based on structured JSON pattern matching in the model response.

Evaluator FlagDescription
jsonpathFlags responses matching a specific JSONPath query pattern (useful for structured outputs).

4. Policy-Based Evaluators

LLM-driven evaluator using external APIs like OpenAI to analyze output based on defined policies.

Evaluator FlagDescription
openaiUses OpenAI model to classify outputs according to a configurable safety policy. Requires OPENAI_API_KEY.

Dataset-Specific Evaluators

Some datasets come with their own recommended evaluator, which is automatically selected when using that dataset via CLI or plan generation.

DatasetEvaluator UsedNotes
airbenchhf_airbenchRequires OpenAI
aisafetyhf_aisafetyEvaluates toxicity, misinformation, and unsafe behaviors
beaverhf_beavertailsInstruction-bending and behavioral prompts
fliphf_flipguarddataDetects obfuscated jailbreaks
garakstringrayBuilt-in evaluator for STRINGRAY dataset
hackaprompthf_hackapromptKnown jailbreak exploits
jbbhf_jailbreakbenchSystematic jailbreak test suite
jbvhf_jailbreakvLatest jailbreak prompt version
lmsyshf_lmsysRequires Hugging Face token (HF_TOKEN)
openaistargazerRequires OPENAI_API_KEY, DETOXIO_API_KEY
safetmhf_safemtdataMulti-turn jailbreak detection

How to Specify an Evaluator Manually

When generating or running a plan, you can override the default evaluator:

dtx redteam plan scope.yml plan.yml --dataset stringray --eval ibm

Or for execution:

dtx redteam run plan.yml ---eval any --keywords research