Skip to main content

💡 Concepts

Red Teaming

Red teaming is the process of challenging the security assumptions of a system with the goal of breaking it and gaining unauthorized access to it's key assets. Red Teaming is conducted by system designers to find and fix security loopholes in the system thereby hardening it against real world attacks.

LLM Red Teaming

While regulations and compliances are yet to catch up, LLM (base models) and applications are expected to align with responsible AI guidelines such as Google AI Principles, OpenAI Safety Guidelines, EU AI Act etc. To align with these guidelines, LLM developers and users are expected to build technical safeguards such as Nemo Guard Rails, OpenAI Moderations API etc.

LLM Red Teaming is the process of challenging, breaking and identifying gaps in safeguards implemented in LLMs or LLM powered applications with the goal of forcing the model to elicit harmful content such as toxicity, bias, sensitive data exposure, malicious use etc. For a fairly comprehensive list of LLM risks, refer to Ethical and Social Risks of Harm from Language Models

Adversarial Prompting

To put it simply, a prompt with the goal and likelihood of forcing an LLM (under testing) to elicit content that is harmful in the context it is used. For example, organizations using an LLM used in the context of a Chat Bot application is likely to consider toxic, hateful or malicious content generation as relevant business risks. However an LLM used as a reasoning engine in an agent architecture will most likely consider jail breaking, prompt injection, data leakage as critical risks.

Vulnerability Detection

Adversarial prompts has the potential to elicit harm from an LLM (under testing). To determine if a given prompt was actually successful in eliciting the desired harm requires evaluation of the LLM response and determining if it is harmful. In the context of LLM red teaming, harmful responses produced by LLMs in response to adversarial prompts are considered as its vulnerabilities. LLM vulnerabilities may have wider downstream impact based on its usage. OWASP LLM Application Top 10 provides reference of LLM application vulnerabilities.

Automation

detoxio.ai helps in LLM red team automation by generating adversarial prompts based on selected context and evaluating LLM (under testing) responses to detect vulnerabilities. Quickstart and Python SDK provides the necessary guidance on getting started.