🎯 BON

BON (Best-of-N with Augmented Variants) is a simple, effective optimizer for a single prompt.
Instead of branching trees or iterative refinements, BON generates N augmented variants of a root prompt, evaluates the target’s responses with a judge, and keeps the best (optionally across multiple rounds).

Conceptually:

Generator (augmenter) produces N prompt variants from a base prompt using a controllable augmentation strength σ.
Target is the black-box model you’re probing.
Evaluator (judge) assigns a numeric score (e.g., 1–10) based on goal satisfaction (and optionally safety/policy criteria).
Selection picks the best-scoring (prompt, response) pair per round. Repeat for R rounds if desired.

BON is not a jailbreak method; it’s a prompt optimizer for allowed objectives under controlled scoring.

⚡ Quick Start

Run BON from the unified CLI with one simple command:

dtx attack run --attack BON --task "Summarize the benefits of regular exercise in 5 concise bullet points"

🖼️ Workflow

BON Workflow

Initialize Provide a base prompt (your task). Decide N (candidates per round), R (rounds), and augmentation strength σ.
Augment Produce N prompt variants from the base using controlled transformations (rewording, constraints, ordering).
Probe Target Send each variant to the Target; collect responses.
Judge The Evaluator returns a numeric score per response for goal satisfaction (and optionally policy/safety conformance).
Select Keep the best-scoring (prompt, response) this round. Optionally seed the next round’s augmentations from the current best.
Stop After R rounds, or early if a success threshold is met.

🔧 Advanced CLI Options (BON)

CLI flag	Meaning	Typical impact
`--bon-candidates N`	Candidates per round (`N`)	Higher = better search coverage, higher cost
`--bon-rounds R`	Number of rounds (`R`)	More rounds = iterative polishing
`--bon-sigma σ`	Augmentation strength (0–1)	Low = subtle edits; high = diverse, risk drift
`--bon-seed SEED`	RNG seed	Reproducibility of augmentations
`--eval-model NAME`	Judge model	Affects scoring stability/calibration
`--target-model NAME`	Target model	The system under test
`--temperature T`	Target sampling temperature	Controls response diversity
`--max-new-tokens N`	Target generation cap	Controls verbosity/latency
`--success-threshold x`	Early stop on judge score	Lower = earlier stops; higher = stricter hits
`--judge-template NAME`	Judge prompt template	Scoring rubric and style

Example with knobs

dtx attack run --attack BON \
  --task "List 5 ergonomic tips for desk workers" \
  --bon-candidates 12 --bon-rounds 2 --bon-sigma 0.35

When to Use BON vs. TAP or PAIR

BON: You have a single prompt and want quick wins via breadth (N variants).
PAIR: You want iterative refinement along a few linear tracks.
TAP: You need deeper search with branching and pruning.

📚 References

BEST-OF-N JAILBREAKING arXiv:2412.03556 (BON)