Skip to main content

🧬 AutoDAN

AutoDAN (Automatic Discrete Adversarial eNgineering) is an evolutionary search that breeds better jailbreak-style prompts over generations. Unlike PAIR’s single-path refinement, AutoDAN maintains a population of candidate prefixes and iteratively applies crossover, rephrase, and synonym mutations, selecting the fittest candidates each round. It’s designed for white-box local targets (HF models) so it can score candidates using the target’s loss, while optionally leveraging a black-box rephraser (e.g., OpenAI) to diversify mutations.

Conceptually:

  • Population: Many candidate prompt prefixes (a “gene pool”).
  • Selection: Score candidates (e.g., via target loss / a pattern scorer) and keep the elites.
  • Mutation / Crossover: Rephrase candidates with an LLM, swap chunks between parents, and replace words via a momentum dictionary.
  • Evaluation: Decode the best candidate on the target and score success.
  • Stop: When success is detected or the step budget is exhausted.

AutoDAN is a procedure, not a single prompt. It measures how readily a target drifts toward an attack goal under controlled, logged evolution.


🖼️ Workflow

AutoDAN Workflow

  1. Seed

    • Generate a population of prompt prefixes from templates (optionally labeled with a model display name for realism).
  2. Score & Select

    • Score each candidate using the white-box target’s signals (loss slices / heuristic scorer). Keep elites and sample new parents.
  3. Mutate & Cross

    • Apply crossover, rephrase (via a rephraser model), and synonym replacement guided by a momentum dictionary.
  4. Evaluate

    • Decode the current best candidate on the target; record the response and success flag/score.
  5. Repeat / Early-stop

    • Iterate for num_steps outer rounds, with optional inner sentence-level mutation steps. Stop on success.

AutoDAN complements tree/beam searches by exploring population diversity with genetic operators rather than expanding a search frontier.


🛠️ Advanced Parameters

ParameterWhat it controlsTypical impact
num_candidatesPopulation size per generationMore diversity; higher cost
num_stepsOuter evolutionary roundsDeeper search; longer runs
sentence_level_stepsInner mutation loops per roundExtra local polish; moderate cost
ratio_elitesFraction of top candidates preservedStabilizes quality; too high can reduce exploration
crossover_rateProbability to perform crossover on parent pairsEncourages mixing; too high can disrupt good chunks
num_pointsCrossover cut pointsControls granularity of recombination
mutation_rateProbability a candidate gets rephrasedAdds variety; too high → noise
word_dict_sizeSize of momentum-based synonym dictionaryBroader substitutions; can dilute semantics
model_display_nameCosmetic label injected into templatesHelps seed realism / templating
Rephraser modelLLM used to rephrase (e.g., OpenAI model)Diversity & fluency of mutations
White-box target (HF)Local HF model under testControls capability and speed
Max new tokens (target)Decoding budget for target responsesVerbosity & latency

🚀 Running AutoDAN with dtx

AutoDAN uses a white-box Hugging Face target and a black-box rephraser. The command below is intentionally minimal—it relies on sensible defaults.

Quick start (tiny local target)

dtx attack run \
--attack AUTODAN \
--task "Explain how transformers work to a high-school student." \
--ad-hf-target-model HuggingFaceTB/SmolLM-135M-Instruct \
--ad-rephrase-model gpt-4o-mini \
--max-new-tokens 128

That’s it. Defaults cover population size, steps, mutation rates, etc., so you can iterate fast on CPU/GPU-light setups.

A bit more control (still lean)

dtx attack run \
--attack AUTODAN \
--task "Outline a study plan for linear algebra in 4 weeks." \
--ad-hf-target-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--ad-rephrase-model gpt-4o-mini \
--ad-candidates 32 \
--ad-steps 20 \
--ad-sent-steps 3 \
--ad-mutation-rate 0.02 \
--ad-crossover-rate 0.5 \
--max-new-tokens 256

Notes

  • Set OPENAI_API_KEY (and optionally OPENAI_API_BASE) in your environment for the rephraser.
  • Smaller targets (e.g., SmolLM-135M) benefit from lower --max-new-tokens for speed.
  • You can set a friendlier label for seeds with --ad-model-display-name "SmolLM".

📚 References

  • AutoDAN (genetic prompt engineering approach) — original community implementations and follow-ups.
  • Tree of Attacks (TAP) & PAIR for complementary search paradigms.
  • Evolutionary search & GA classics for mutation/crossover strategies.

Use AutoDAN responsibly for safety evaluation and robustness testing. Keep tasks benign in demos and respect policies & laws.