Skip to main content

🎭 ACTOR

ACTOR (Derail Yourself Attack) is a multi-turn jailbreak method where the attacker plays different roles (“actors”) to derail the target system.
Instead of brute-forcing prompts or relying on evolutionary tweaks, ACTOR exploits dialogue dynamics:
the system is gradually lured into unsafe territory by self-discovered questions and role-play that unfold step by step.

Conceptually:

  • Behavior extraction: Identify the target’s safety posture and attack surface.
  • Actor generation: Create roles (actors) with specific perspectives or biases.
  • Question discovery: Generate probing questions that actors can ask, chaining them into a conversational script.
  • Target engagement: Play out the dialogue, using the chosen actor to steer the model.
  • Judge scoring: Evaluate the final response against a success template.

Unlike AutoDAN’s population evolution, ACTOR is dialogue-driven, measuring how easily the model can be led astray over turns.


🖼️ Workflow

ACTOR Workflow

  1. Behavior Model
    • Use an LLM to extract the safety-relevant behavior of the target.
  2. Actors Model
    • Propose different actor personas (e.g., a journalist, a hacker, a curious child).
  3. Questions Model
    • Generate multi-turn questions the actor might ask to chip away at safety defenses.
    • Typically runs on a separate provider (e.g., Groq for scale).
  4. Dialogue Simulation
    • Run the actor’s questions against the target model turn by turn.
  5. Judge
    • Score the target’s final answer with a judge LLM/template (chao2023pair by default).
  6. Stop
    • Return the highest-scoring (prompt, response) pair when success threshold is hit.

🛠️ Advanced Parameters

ParameterWhat it controlsTypical impact
behavior_modelLLM used to extract target’s safety behaviorInfluences attack surface definition
actors_modelLLM that generates candidate actorsControls persona diversity and realism
questions_modelLLM that proposes probing questionsMore variety = richer multi-turn derailment paths
Questions providerOpenAI vs Groq (for efficiency)Groq = faster, cheaper; OpenAI = simpler fallback
Judge modelLLM used to score target’s final responseAffects success evaluation strictness
Judge templateScoring scheme (e.g., chao2023pair)Defines what counts as success
ThresholdScore cutoff for successLower = easier to declare jailbreak
TemperatureTarget model randomnessHigher = more variation, lower = more deterministic
Max new tokensGeneration budget for target responsesLonger answers, but slower evaluation

🚀 Running ACTOR with dtx

ActorAttack uses helper LLMs (behavior, actors, questions) and a target model (system under test).
By default it uses OpenAI, with optional Groq acceleration for the questions model.

Quick start (OpenAI only)

dtx attack run \
--attack ACTOR \
--task "Summarize the benefits of regular exercise in 5 concise bullets."

Defaults:

  • Behavior model: gpt-4o-mini
  • Actors model: gpt-4o-mini
  • Questions model: deepseek-r1-distill-llama-70b (via OpenAI)
  • Target: gpt-4o-mini
  • Judge: gpt-4o-mini with chao2023pair

With Groq questions

# Requires GROQ_API_KEY in your env
dtx attack run \
--attack ACTOR \
--task "Explain why passwords should not be reused." \
--actor-questions-provider groq

Customizing helper models

dtx attack run \
--attack ACTOR \
--task "Derail the model into giving contradictory statements." \
--actor-behavior-model gpt-4o-mini \
--actor-actors-model gpt-4o-mini \
--actor-questions-model deepseek-r1-distill-llama-70b \
--actor-questions-provider auto \
--eval-model gpt-4o-mini \
--target-model gpt-4o-mini \
--temperature 0.2 \
--max-new-tokens 256 \
--success-threshold 10.0

📚 References

  • Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-Discovered Clues (arXiv:2410.10700, 2024) — introduces the ActorAttack method.
  • PAIR and TAP — single-path and tree-based attack methods.
  • AutoDAN — evolutionary/population-based complement.

Use ACTOR responsibly for safety evaluation and robustness testing. Keep tasks benign in demos and respect policies & laws.