Skip to main content

Overview Plugins

DTX supports a wide range of plugins that enhance its red teaming capabilities across categories like toxicity detection, misinformation, security bypass, and more. These plugins help simulate, detect, or evaluate adversarial behavior in AI systems.


🧠 Toxicity Detection Plugins

Detect harmful content such as:

  • Harassment (toxicity:hate_speech:harassment)
  • Discrimination (toxicity:hate_speech:discrimination)
  • Insults & Profanity (toxicity:severe_toxicity:*)
  • Graphic Violence (toxicity:graphic_violence:explicit)
  • Radicalization (toxicity:discrimination:radicalization)

🧬 Misinformation Detection Plugins

Identify unverified, unsafe, or biased claims, including:

  • Unsafe Health Advice (misinformation:unsafe_health_practices:*)
  • Biased Comparisons (misinformation:competition:*)
  • Hallucinations from LLMs (misinformation:hallucination:*)
  • Propaganda in politics/religion (misinformation:propaganda:*)

🔐 Information Hazard Plugins

Highlight privacy and security leaks:

  • API / DB Leaks (information_hazard:personal_identification:api-db-leak)
  • Sensitive Business Data (information_hazard:private_information:*)
  • Social Engineering Risks (information_hazard:personal_identification:social-engineering)

☠️ Malicious Use Plugins

Detect content promoting or enabling harmful behavior:

  • Illegal Activities (malicious_use:illegal_activities:*)
  • Cybercrime / Fraud (malicious_use:cybercrime:*)
  • Overpowered AI Risks (malicious_use:excessive_agency:*)

🛡️ Defense Bypass Plugins

Simulate real-world attack techniques:

  • SQL / Shell Injection (defense_bypass:hacking:*)
  • Prompt Injection Attacks (defense_bypass:prompt_injection:*)
  • Security Evasion Tactics (defense_bypass:security_evasion:*)

These plugins enable powerful, modular evaluation of model behavior in realistic adversarial scenarios. You can explore or enable them via:

dtx plugins list