🧬 AutoDAN
AutoDAN (Automatic Discrete Adversarial eNgineering) is an evolutionary search that breeds better jailbreak-style prompts over generations. Unlike PAIR’s single-path refinement, AutoDAN maintains a population of candidate prefixes and iteratively applies crossover, rephrase, and synonym mutations, selecting the fittest candidates each round. It’s designed for white-box local targets (HF models) so it can score candidates using the target’s loss, while optionally leveraging a black-box rephraser (e.g., OpenAI) to diversify mutations.
Conceptually:
- Population: Many candidate prompt prefixes (a “gene pool”).
- Selection: Score candidates (e.g., via target loss / a pattern scorer) and keep the elites.
- Mutation / Crossover: Rephrase candidates with an LLM, swap chunks between parents, and replace words via a momentum dictionary.
- Evaluation: Decode the best candidate on the target and score success.
- Stop: When success is detected or the step budget is exhausted.
AutoDAN is a procedure, not a single prompt. It measures how readily a target drifts toward an attack goal under controlled, logged evolution.
🖼️ Workflow
-
Seed
- Generate a population of prompt prefixes from templates (optionally labeled with a model display name for realism).
-
Score & Select
- Score each candidate using the white-box target’s signals (loss slices / heuristic scorer). Keep elites and sample new parents.
-
Mutate & Cross
- Apply crossover, rephrase (via a rephraser model), and synonym replacement guided by a momentum dictionary.
-
Evaluate
- Decode the current best candidate on the target; record the response and success flag/score.
-
Repeat / Early-stop
- Iterate for
num_steps
outer rounds, with optional inner sentence-level mutation steps. Stop on success.
- Iterate for
AutoDAN complements tree/beam searches by exploring population diversity with genetic operators rather than expanding a search frontier.
🛠️ Advanced Parameters
Parameter | What it controls | Typical impact |
---|---|---|
num_candidates | Population size per generation | More diversity; higher cost |
num_steps | Outer evolutionary rounds | Deeper search; longer runs |
sentence_level_steps | Inner mutation loops per round | Extra local polish; moderate cost |
ratio_elites | Fraction of top candidates preserved | Stabilizes quality; too high can reduce exploration |
crossover_rate | Probability to perform crossover on parent pairs | Encourages mixing; too high can disrupt good chunks |
num_points | Crossover cut points | Controls granularity of recombination |
mutation_rate | Probability a candidate gets rephrased | Adds variety; too high → noise |
word_dict_size | Size of momentum-based synonym dictionary | Broader substitutions; can dilute semantics |
model_display_name | Cosmetic label injected into templates | Helps seed realism / templating |
Rephraser model | LLM used to rephrase (e.g., OpenAI model) | Diversity & fluency of mutations |
White-box target (HF) | Local HF model under test | Controls capability and speed |
Max new tokens (target) | Decoding budget for target responses | Verbosity & latency |
🚀 Running AutoDAN with dtx
AutoDAN uses a white-box Hugging Face target and a black-box rephraser. The command below is intentionally minimal—it relies on sensible defaults.
Quick start (tiny local target)
dtx attack run \
--attack AUTODAN \
--task "Explain how transformers work to a high-school student." \
--ad-hf-target-model HuggingFaceTB/SmolLM-135M-Instruct \
--ad-rephrase-model gpt-4o-mini \
--max-new-tokens 128
That’s it. Defaults cover population size, steps, mutation rates, etc., so you can iterate fast on CPU/GPU-light setups.
A bit more control (still lean)
dtx attack run \
--attack AUTODAN \
--task "Outline a study plan for linear algebra in 4 weeks." \
--ad-hf-target-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--ad-rephrase-model gpt-4o-mini \
--ad-candidates 32 \
--ad-steps 20 \
--ad-sent-steps 3 \
--ad-mutation-rate 0.02 \
--ad-crossover-rate 0.5 \
--max-new-tokens 256
Notes
- Set
OPENAI_API_KEY
(and optionallyOPENAI_API_BASE
) in your environment for the rephraser. - Smaller targets (e.g., SmolLM-135M) benefit from lower
--max-new-tokens
for speed. - You can set a friendlier label for seeds with
--ad-model-display-name "SmolLM"
.
📚 References
- AutoDAN (genetic prompt engineering approach) — original community implementations and follow-ups.
- Tree of Attacks (TAP) & PAIR for complementary search paradigms.
- Evolutionary search & GA classics for mutation/crossover strategies.
Use AutoDAN responsibly for safety evaluation and robustness testing. Keep tasks benign in demos and respect policies & laws.