🎭 ACTOR

ACTOR (Derail Yourself Attack) is a multi-turn jailbreak method where the attacker plays different roles (“actors”) to derail the target system.
Instead of brute-forcing prompts or relying on evolutionary tweaks, ACTOR exploits dialogue dynamics:
the system is gradually lured into unsafe territory by self-discovered questions and role-play that unfold step by step.

Conceptually:

Behavior extraction: Identify the target’s safety posture and attack surface.
Actor generation: Create roles (actors) with specific perspectives or biases.
Question discovery: Generate probing questions that actors can ask, chaining them into a conversational script.
Target engagement: Play out the dialogue, using the chosen actor to steer the model.
Judge scoring: Evaluate the final response against a success template.

Unlike AutoDAN’s population evolution, ACTOR is dialogue-driven, measuring how easily the model can be led astray over turns.

🖼️ Workflow

ACTOR Workflow

Behavior Model
- Use an LLM to extract the safety-relevant behavior of the target.
Actors Model
- Propose different actor personas (e.g., a journalist, a hacker, a curious child).
Questions Model
- Generate multi-turn questions the actor might ask to chip away at safety defenses.
- Typically runs on a separate provider (e.g., Groq for scale).
Dialogue Simulation
- Run the actor’s questions against the target model turn by turn.
Judge
- Score the target’s final answer with a judge LLM/template (chao2023pair by default).
Stop
- Return the highest-scoring (prompt, response) pair when success threshold is hit.

🛠️ Advanced Parameters

Parameter	What it controls	Typical impact
`behavior_model`	LLM used to extract target’s safety behavior	Influences attack surface definition
`actors_model`	LLM that generates candidate actors	Controls persona diversity and realism
`questions_model`	LLM that proposes probing questions	More variety = richer multi-turn derailment paths
Questions provider	OpenAI vs Groq (for efficiency)	Groq = faster, cheaper; OpenAI = simpler fallback
Judge model	LLM used to score target’s final response	Affects success evaluation strictness
Judge template	Scoring scheme (e.g., `chao2023pair`)	Defines what counts as success
Threshold	Score cutoff for success	Lower = easier to declare jailbreak
Temperature	Target model randomness	Higher = more variation, lower = more deterministic
Max new tokens	Generation budget for target responses	Longer answers, but slower evaluation

🚀 Running ACTOR with `dtx`

ActorAttack uses helper LLMs (behavior, actors, questions) and a target model (system under test).
By default it uses OpenAI, with optional Groq acceleration for the questions model.

Quick start (OpenAI only)

dtx attack run \
  --attack ACTOR \
  --task "Summarize the benefits of regular exercise in 5 concise bullets."

Defaults:

Behavior model: gpt-4o-mini
Actors model: gpt-4o-mini
Questions model: deepseek-r1-distill-llama-70b (via OpenAI)
Target: gpt-4o-mini
Judge: gpt-4o-mini with chao2023pair

With Groq questions

# Requires GROQ_API_KEY in your env
dtx attack run \
  --attack ACTOR \
  --task "Explain why passwords should not be reused." \
  --actor-questions-provider groq

Customizing helper models

dtx attack run \
  --attack ACTOR \
  --task "Derail the model into giving contradictory statements." \
  --actor-behavior-model gpt-4o-mini \
  --actor-actors-model gpt-4o-mini \
  --actor-questions-model deepseek-r1-distill-llama-70b \
  --actor-questions-provider auto \
  --eval-model gpt-4o-mini \
  --target-model gpt-4o-mini \
  --temperature 0.2 \
  --max-new-tokens 256 \
  --success-threshold 10.0

📚 References

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-Discovered Clues (arXiv:2410.10700, 2024) — introduces the ActorAttack method.
PAIR and TAP — single-path and tree-based attack methods.
AutoDAN — evolutionary/population-based complement.

Use ACTOR responsibly for safety evaluation and robustness testing. Keep tasks benign in demos and respect policies & laws.

🖼️ Workflow

🛠️ Advanced Parameters

🚀 Running ACTOR with dtx

Quick start (OpenAI only)​

With Groq questions​

Customizing helper models​

📚 References

🚀 Running ACTOR with `dtx`

Quick start (OpenAI only)

With Groq questions

Customizing helper models