Overview Plugins
DTX supports a wide range of plugins that enhance its red teaming capabilities across categories like toxicity detection, misinformation, security bypass, and more. These plugins help simulate, detect, or evaluate adversarial behavior in AI systems.
🧠 Toxicity Detection Plugins
Detect harmful content such as:
- Harassment (
toxicity:hate_speech:harassment
) - Discrimination (
toxicity:hate_speech:discrimination
) - Insults & Profanity (
toxicity:severe_toxicity:*
) - Graphic Violence (
toxicity:graphic_violence:explicit
) - Radicalization (
toxicity:discrimination:radicalization
)
🧬 Misinformation Detection Plugins
Identify unverified, unsafe, or biased claims, including:
- Unsafe Health Advice (
misinformation:unsafe_health_practices:*
) - Biased Comparisons (
misinformation:competition:*
) - Hallucinations from LLMs (
misinformation:hallucination:*
) - Propaganda in politics/religion (
misinformation:propaganda:*
)
🔐 Information Hazard Plugins
Highlight privacy and security leaks:
- API / DB Leaks (
information_hazard:personal_identification:api-db-leak
) - Sensitive Business Data (
information_hazard:private_information:*
) - Social Engineering Risks (
information_hazard:personal_identification:social-engineering
)
☠️ Malicious Use Plugins
Detect content promoting or enabling harmful behavior:
- Illegal Activities (
malicious_use:illegal_activities:*
) - Cybercrime / Fraud (
malicious_use:cybercrime:*
) - Overpowered AI Risks (
malicious_use:excessive_agency:*
)
🛡️ Defense Bypass Plugins
Simulate real-world attack techniques:
- SQL / Shell Injection (
defense_bypass:hacking:*
) - Prompt Injection Attacks (
defense_bypass:prompt_injection:*
) - Security Evasion Tactics (
defense_bypass:security_evasion:*
)
These plugins enable powerful, modular evaluation of model behavior in realistic adversarial scenarios. You can explore or enable them via:
dtx plugins list