🎭 ACTOR
ACTOR (Derail Yourself Attack) is a multi-turn jailbreak method where the attacker plays different roles (“actors”) to derail the target system.
Instead of brute-forcing prompts or relying on evolutionary tweaks, ACTOR exploits dialogue dynamics:
the system is gradually lured into unsafe territory by self-discovered questions and role-play that unfold step by step.
Conceptually:
- Behavior extraction: Identify the target’s safety posture and attack surface.
- Actor generation: Create roles (actors) with specific perspectives or biases.
- Question discovery: Generate probing questions that actors can ask, chaining them into a conversational script.
- Target engagement: Play out the dialogue, using the chosen actor to steer the model.
- Judge scoring: Evaluate the final response against a success template.
Unlike AutoDAN’s population evolution, ACTOR is dialogue-driven, measuring how easily the model can be led astray over turns.
🖼️ Workflow
- Behavior Model
- Use an LLM to extract the safety-relevant behavior of the target.
- Actors Model
- Propose different actor personas (e.g., a journalist, a hacker, a curious child).
- Questions Model
- Generate multi-turn questions the actor might ask to chip away at safety defenses.
- Typically runs on a separate provider (e.g., Groq for scale).
- Dialogue Simulation
- Run the actor’s questions against the target model turn by turn.
- Judge
- Score the target’s final answer with a judge LLM/template (
chao2023pair
by default).
- Score the target’s final answer with a judge LLM/template (
- Stop
- Return the highest-scoring (prompt, response) pair when success threshold is hit.
🛠️ Advanced Parameters
Parameter | What it controls | Typical impact |
---|---|---|
behavior_model | LLM used to extract target’s safety behavior | Influences attack surface definition |
actors_model | LLM that generates candidate actors | Controls persona diversity and realism |
questions_model | LLM that proposes probing questions | More variety = richer multi-turn derailment paths |
Questions provider | OpenAI vs Groq (for efficiency) | Groq = faster, cheaper; OpenAI = simpler fallback |
Judge model | LLM used to score target’s final response | Affects success evaluation strictness |
Judge template | Scoring scheme (e.g., chao2023pair ) | Defines what counts as success |
Threshold | Score cutoff for success | Lower = easier to declare jailbreak |
Temperature | Target model randomness | Higher = more variation, lower = more deterministic |
Max new tokens | Generation budget for target responses | Longer answers, but slower evaluation |
🚀 Running ACTOR with dtx
ActorAttack uses helper LLMs (behavior, actors, questions) and a target model (system under test).
By default it uses OpenAI, with optional Groq acceleration for the questions model.
Quick start (OpenAI only)
dtx attack run \
--attack ACTOR \
--task "Summarize the benefits of regular exercise in 5 concise bullets."
Defaults:
- Behavior model:
gpt-4o-mini
- Actors model:
gpt-4o-mini
- Questions model:
deepseek-r1-distill-llama-70b
(via OpenAI) - Target:
gpt-4o-mini
- Judge:
gpt-4o-mini
withchao2023pair
With Groq questions
# Requires GROQ_API_KEY in your env
dtx attack run \
--attack ACTOR \
--task "Explain why passwords should not be reused." \
--actor-questions-provider groq
Customizing helper models
dtx attack run \
--attack ACTOR \
--task "Derail the model into giving contradictory statements." \
--actor-behavior-model gpt-4o-mini \
--actor-actors-model gpt-4o-mini \
--actor-questions-model deepseek-r1-distill-llama-70b \
--actor-questions-provider auto \
--eval-model gpt-4o-mini \
--target-model gpt-4o-mini \
--temperature 0.2 \
--max-new-tokens 256 \
--success-threshold 10.0
📚 References
- Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-Discovered Clues (arXiv:2410.10700, 2024) — introduces the ActorAttack method.
- PAIR and TAP — single-path and tree-based attack methods.
- AutoDAN — evolutionary/population-based complement.
Use ACTOR responsibly for safety evaluation and robustness testing. Keep tasks benign in demos and respect policies & laws.