🧬 AutoDAN

AutoDAN (Automatic Discrete Adversarial eNgineering) is an evolutionary search that breeds better jailbreak-style prompts over generations. Unlike PAIR’s single-path refinement, AutoDAN maintains a population of candidate prefixes and iteratively applies crossover, rephrase, and synonym mutations, selecting the fittest candidates each round. It’s designed for white-box local targets (HF models) so it can score candidates using the target’s loss, while optionally leveraging a black-box rephraser (e.g., OpenAI) to diversify mutations.

Conceptually:

Population: Many candidate prompt prefixes (a “gene pool”).
Selection: Score candidates (e.g., via target loss / a pattern scorer) and keep the elites.
Mutation / Crossover: Rephrase candidates with an LLM, swap chunks between parents, and replace words via a momentum dictionary.
Evaluation: Decode the best candidate on the target and score success.
Stop: When success is detected or the step budget is exhausted.

AutoDAN is a procedure, not a single prompt. It measures how readily a target drifts toward an attack goal under controlled, logged evolution.

🖼️ Workflow

AutoDAN Workflow

Seed
- Generate a population of prompt prefixes from templates (optionally labeled with a model display name for realism).
Score & Select
- Score each candidate using the white-box target’s signals (loss slices / heuristic scorer). Keep elites and sample new parents.
Mutate & Cross
- Apply crossover, rephrase (via a rephraser model), and synonym replacement guided by a momentum dictionary.
Evaluate
- Decode the current best candidate on the target; record the response and success flag/score.
Repeat / Early-stop
- Iterate for num_steps outer rounds, with optional inner sentence-level mutation steps. Stop on success.

AutoDAN complements tree/beam searches by exploring population diversity with genetic operators rather than expanding a search frontier.

🛠️ Advanced Parameters

Parameter	What it controls	Typical impact
`num_candidates`	Population size per generation	More diversity; higher cost
`num_steps`	Outer evolutionary rounds	Deeper search; longer runs
`sentence_level_steps`	Inner mutation loops per round	Extra local polish; moderate cost
`ratio_elites`	Fraction of top candidates preserved	Stabilizes quality; too high can reduce exploration
`crossover_rate`	Probability to perform crossover on parent pairs	Encourages mixing; too high can disrupt good chunks
`num_points`	Crossover cut points	Controls granularity of recombination
`mutation_rate`	Probability a candidate gets rephrased	Adds variety; too high → noise
`word_dict_size`	Size of momentum-based synonym dictionary	Broader substitutions; can dilute semantics
`model_display_name`	Cosmetic label injected into templates	Helps seed realism / templating
Rephraser model	LLM used to rephrase (e.g., OpenAI model)	Diversity & fluency of mutations
White-box target (HF)	Local HF model under test	Controls capability and speed
Max new tokens (target)	Decoding budget for target responses	Verbosity & latency

🚀 Running AutoDAN with `dtx`

AutoDAN uses a white-box Hugging Face target and a black-box rephraser. The command below is intentionally minimal—it relies on sensible defaults.

Quick start (tiny local target)

dtx attack run \
  --attack AUTODAN \
  --task "Explain how transformers work to a high-school student." \
  --ad-hf-target-model HuggingFaceTB/SmolLM-135M-Instruct \
  --ad-rephrase-model gpt-4o-mini \
  --max-new-tokens 128

That’s it. Defaults cover population size, steps, mutation rates, etc., so you can iterate fast on CPU/GPU-light setups.

A bit more control (still lean)

dtx attack run \
  --attack AUTODAN \
  --task "Outline a study plan for linear algebra in 4 weeks." \
  --ad-hf-target-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --ad-rephrase-model gpt-4o-mini \
  --ad-candidates 32 \
  --ad-steps 20 \
  --ad-sent-steps 3 \
  --ad-mutation-rate 0.02 \
  --ad-crossover-rate 0.5 \
  --max-new-tokens 256

Notes

Set OPENAI_API_KEY (and optionally OPENAI_API_BASE) in your environment for the rephraser.
Smaller targets (e.g., SmolLM-135M) benefit from lower --max-new-tokens for speed.
You can set a friendlier label for seeds with --ad-model-display-name "SmolLM".

📚 References

AutoDAN (genetic prompt engineering approach) — original community implementations and follow-ups.
Tree of Attacks (TAP) & PAIR for complementary search paradigms.
Evolutionary search & GA classics for mutation/crossover strategies.

Use AutoDAN responsibly for safety evaluation and robustness testing. Keep tasks benign in demos and respect policies & laws.

🖼️ Workflow

🛠️ Advanced Parameters

🚀 Running AutoDAN with dtx

Quick start (tiny local target)​

A bit more control (still lean)​

📚 References

🚀 Running AutoDAN with `dtx`

Quick start (tiny local target)

A bit more control (still lean)