π― BON
BON (Best-of-N with Augmented Variants) is a simple, effective optimizer for a single prompt.
Instead of branching trees or iterative refinements, BON generates N augmented variants of a root prompt, evaluates the targetβs responses with a judge, and keeps the best (optionally across multiple rounds).
Conceptually:
- Generator (augmenter) produces
N
prompt variants from a base prompt using a controllable augmentation strengthΟ
. - Target is the black-box model youβre probing.
- Evaluator (judge) assigns a numeric score (e.g., 1β10) based on goal satisfaction (and optionally safety/policy criteria).
- Selection picks the best-scoring (prompt, response) pair per round. Repeat for
R
rounds if desired.
BON is not a jailbreak method; itβs a prompt optimizer for allowed objectives under controlled scoring.
β‘ Quick Start
Run BON from the unified CLI with one simple command:
dtx attack run --attack BON --task "Summarize the benefits of regular exercise in 5 concise bullet points"
πΌοΈ Workflow
-
Initialize Provide a base prompt (your task). Decide
N
(candidates per round),R
(rounds), and augmentation strengthΟ
. -
Augment Produce
N
prompt variants from the base using controlled transformations (rewording, constraints, ordering). -
Probe Target Send each variant to the Target; collect responses.
-
Judge The Evaluator returns a numeric score per response for goal satisfaction (and optionally policy/safety conformance).
-
Select Keep the best-scoring (prompt, response) this round. Optionally seed the next roundβs augmentations from the current best.
-
Stop After
R
rounds, or early if a success threshold is met.
π§ Advanced CLI Options (BON)
CLI flag | Meaning | Typical impact |
---|---|---|
--bon-candidates N | Candidates per round (N ) | Higher = better search coverage, higher cost |
--bon-rounds R | Number of rounds (R ) | More rounds = iterative polishing |
--bon-sigma Ο | Augmentation strength (0β1) | Low = subtle edits; high = diverse, risk drift |
--bon-seed SEED | RNG seed | Reproducibility of augmentations |
--eval-model NAME | Judge model | Affects scoring stability/calibration |
--target-model NAME | Target model | The system under test |
--temperature T | Target sampling temperature | Controls response diversity |
--max-new-tokens N | Target generation cap | Controls verbosity/latency |
--success-threshold x | Early stop on judge score | Lower = earlier stops; higher = stricter hits |
--judge-template NAME | Judge prompt template | Scoring rubric and style |
Example with knobsβ
dtx attack run --attack BON \
--task "List 5 ergonomic tips for desk workers" \
--bon-candidates 12 --bon-rounds 2 --bon-sigma 0.35
When to Use BON vs. TAP or PAIR
- BON: You have a single prompt and want quick wins via breadth (N variants).
- PAIR: You want iterative refinement along a few linear tracks.
- TAP: You need deeper search with branching and pruning.
π References
- BEST-OF-N JAILBREAKING arXiv:2412.03556 (BON)