Data Leaks

What Is a Data Leak in GenAI?

A data leak in GenAI refers to any situation where sensitive, private, or regulated data is unintentionally exposed to the LLM — either in its input, training context, or output.

This includes:

IP addresses
API keys and secrets
JWTs and tokens
Credentials and user-provided secrets
Personally Identifiable Information (PII)

Once data enters the model, it becomes difficult to control where it goes — it may appear in outputs, logs, embeddings, or future completions.

🕳️ Common Sources of Data Leaks

Source	Example
User Input	“My token is `sk-test-xyz123`”
RAG Content	Documents with IPs, internal domains, or credentials
System Messages	Logs, stack traces, or responses passed to the model
Prompt Engineering	Hidden memory containing past user secrets

⚠️ Why It’s Risky

Models do not differentiate between public and private data
Masking or redaction after the fact is not reliable
Leaked data may violate GDPR, HIPAA, SOC 2, or company policy
A single exposure may compromise system integrity or user trust

🛡️ How DTX Guard Prevents Data Leaks

DTX Guard uses Homomorphic Masking (Hasking) to transparently replace sensitive information before the prompt reaches the LLM.

This technique ensures:

The LLM never sees raw data
Inputs remain logically consistent
Outputs are reassembled post-inference

🧱 Homomorphic Masking (DLP)

DTX Guard uses homomorphic masking to protect sensitive data without breaking LLM understanding.

Instead of removing or redacting data, it replaces sensitive values (like IPs, keys, tokens) with consistent and meaningful substitutes — enabling the LLM to infer, compare, and reason as if the original values were present.

🔍 Note: The replacements are not generic placeholders like "{{MASK_1}}". Instead, they are realistic, consistent equivalents (e.g., other internal IP addresses or structurally correct keys) — so that the model behaves naturally.

Key Properties

Feature	Description
Consistent Substitution	Same sensitive input → same masked output across prompts
LLM-Compatible	Masked data retains structural patterns (e.g., IPs look like IPs)
Reversible	Each transformation is tied to a `context_id` so original values can be restored post-response

🧬 Example Flow

Step	Description
1	Input: `"Login to 10.0.0.1 with admin:secret123"`
2	Masked: `"Login to 10.10.30.40 with user:zxcv0987"`
3	LLM Response: `"Access 10.10.30.40 failed with user:zxcv0987"`
4	Final Output: `"Access 10.0.0.1 failed with admin:secret123"`

This ensures the model sees valid but safe data — and your system regains control before any output is shown to the user.

🧪 API Example

Step 1: Mask the Data

POST /v2/hask

{
  "text": "Server IP is 172.31.25.5, token is sk-prod-abc123"
}

Response:

{
  "output": "Server IP is 10.240.0.1, token is sk-prod-zzx998",
  "context_id": "ctx-8ff2a91b"
}

Step 2: Unmask After Model Response

POST /v2/dehask

{
  "text": "Connected to 10.240.0.1 using sk-prod-zzx998",
  "context_id": "ctx-8ff2a91b"
}

Response:

{
  "output": "Connected to 172.31.25.5 using sk-prod-abc123"
}

🐍 Python SDK Example

from dtx_prompt_guard_client.dlp import DLPClient, HaskInput, DehaskInput

dlp = DLPClient(base_url="http://localhost:8000")

# Step 1: Mask sensitive input
text = "API call to 192.168.1.10 with token: abc123"
masked = dlp.hask(HaskInput(text=text))
print("Masked:", masked.output)

# Step 2: Use masked.output in LLM prompt...

# Step 3: Unmask response
response_text = f"Response from {masked.output}"
dehasked = dlp.dehask(DehaskInput(
    text=response_text,
    context_id=masked.context_id
))

print("Final Output:", dehasked.output)

✅ When to Use This

Use Case	Reason
Public-facing LLM UIs	Prevent users from unintentionally exposing secrets
Support tools / security analysis	Handle sensitive logs, credentials, IPs securely
RAG + internal documents	Automatically sanitize input content
Compliant Environments (HIPAA)	Data never leaves your boundary unmasked

What Is a Data Leak in GenAI?​

🕳️ Common Sources of Data Leaks​

⚠️ Why It’s Risky​

🛡️ How DTX Guard Prevents Data Leaks​

🧱 Homomorphic Masking (DLP)​

Key Properties​

🧬 Example Flow​

🧪 API Example​

Step 1: Mask the Data​

Step 2: Unmask After Model Response​

🐍 Python SDK Example​

✅ When to Use This​