Llama Prompt Guard Models
Meta's Llama Prompt Guard models are lightweight classifiers that detect prompt-based attacks β including prompt injection and jailbreaking β targeting LLM-based applications.
Prompt Guard models are designed to run before an LLM to filter malicious instructions that try to override developer or system intent.
Model Versionsβ
Version | Model | Parameters | Description |
---|---|---|---|
v1 | Prompt-Guard-86M | 86M | Initial release to detect prompt attacks |
v2 | Llama-Prompt-Guard-2-86M | 86M | Improved multilingual, lower false positives |
v2 | Llama-Prompt-Guard-2-22M | 22M | Smaller, faster, DeBERTa-xsmall based |
What Prompt Guard Detectsβ
Llama Prompt Guard models help mitigate:
- Prompt Injection: Malicious manipulation of inputs to override system prompts or user expectations.
- Jailbreak Attacks: Attempts to directly bypass safety features or content restrictions.
Unlike general safety filters like Llama Guard, Prompt Guard is focused only on intent to override, not general harmful content.
Prompt Guard models as First Line of Defenseβ
Prompt Guard sits in front of the LLM. It scans each prompt and blocks those identified as attempts to override system instructions, regardless of the surface-level content.
Threat Modelβ
The diagram below illustrates the evaluation strategy for testing guard models like Llama Prompt Guard 2 using both naive and adversarial prompt attacks. The process begins with malicious prompts and then applies jailbreak transformations to generate more evasive inputs. The modelβs ability to correctly detect these is measured using the Attack Success Rate (ASR).
Key Features in Prompt Guard 2β
- Improved false-positive control on out-of-distribution data
- Multilingual detection (English, French, German, Hindi, Italian, Portuguese, Spanish, Thai)
- Binary classification:
malicious
orbenign
- Resistant to whitespace and token manipulation attacks
- Faster option: the 22M version reduces latency by ~75% with minimal performance trade-offs