PILLAR_S06 LIVE ALWAYS-EVOLVING

What does this agent actually touch?

Risk catalog. Data-flow diagrams. Decision trees. Plain English safety lens — not enterprise security theatre.

14 risks cataloged

RISK CATALOG // CURRENT THREATS

Things that go wrong

PI-001 HIGH

Prompt injection (direct)

Attacker text in user input overrides system prompt; agent follows malicious instructions.

PI-002 CRITICAL

Prompt injection (indirect)

Malicious content in fetched web pages, files, or emails gets parsed as instructions.

PI-003 HIGH

Prompt injection via RAG

Hostile documents indexed into your knowledge base inject instructions into every retrieval.

DF-001 HIGH

Over-broad tool access

Agent has more permissions than the task needs. Damage radius if compromised: large.

DF-002 CRITICAL

Cross-tenant data leak

Agent reads from one user's data and writes to another. Common in shared deployments.

SC-001 CRITICAL

Untrusted MCP server

A malicious or compromised MCP server can read what you send it and return crafted responses to influence agent behaviour.

AUTH-001 HIGH

OAuth token leakage

Tokens stored in agent memory, logs, or transcripts can leak through error messages or shared sessions.

TR-001 HIGH

Tool-result poisoning

A tool returns crafted output (e.g. fake search results) that the agent then acts on as truth.

FM-001 MEDIUM

Tool-call infinite loop

Agent calls same tool repeatedly without progress. Burns budget, no output.

FM-002 MEDIUM

Context window exhaustion

Long-running agent fills context with stale data; behaviour becomes unstable.

FM-003 HIGH

Hallucinated tool results

Agent fabricates "successful" tool output without actually calling the tool.

HITL-001 MEDIUM

Approval fatigue

Humans rubber-stamp every "are you sure?" prompt. The gate becomes ceremonial, not protective.

PR-001 HIGH

Inadvertent training data leak

Sensitive prompts/data sent to providers without retention review.

LOG-001 HIGH

Sensitive data in traces / logs

Observability tools capture full prompts and tool inputs. Forgotten log retention often = compliance breach waiting to happen.

DECISION TREES // SHOULD I LET...

Common 'should I let it' questions

Should I let an agent read my email?

VERDICT Yes, with rules

Read-only access, allow-list senders, never let it auto-send.

Should an agent have my GitHub repo write access?

VERDICT PR-only, never push to main

Force pull-request flow. Never grant force-push or admin.

Should an agent run shell commands on my laptop?

VERDICT Sandboxed only

Use a container or VM. Never grant raw shell on host machine.

Should an agent access customer data?

VERDICT Tenant-isolated only

Pass scoped credentials; agent must never see cross-tenant data.

THE PROMISE

What we cover here

Most "AI safety" content is either enterprise-level (governance frameworks, compliance) or academic (alignment, interpretability research). Both important — but neither helps a developer asking "is it safe to let this agent touch my email?"

This pillar fills that gap, in plain English, with examples:

Risk catalog — every failure mode we've seen, with severity
Data flow diagrams — what does each MCP / agent / tool actually access?
Decision trees — interactive "should I" flowcharts
Privacy notes — where data goes, per provider
Threat models — different advice for hobbyist vs enterprise

Prompt injection Data flows Threat models