Prompt injection (direct)
Attacker text in user input overrides system prompt; agent follows malicious instructions.
Risk catalog. Data-flow diagrams. Decision trees. Plain English safety lens — not enterprise security theatre.
Attacker text in user input overrides system prompt; agent follows malicious instructions.
Malicious content in fetched web pages, files, or emails gets parsed as instructions.
Hostile documents indexed into your knowledge base inject instructions into every retrieval.
Agent has more permissions than the task needs. Damage radius if compromised: large.
Agent reads from one user's data and writes to another. Common in shared deployments.
A malicious or compromised MCP server can read what you send it and return crafted responses to influence agent behaviour.
Tokens stored in agent memory, logs, or transcripts can leak through error messages or shared sessions.
A tool returns crafted output (e.g. fake search results) that the agent then acts on as truth.
Agent calls same tool repeatedly without progress. Burns budget, no output.
Long-running agent fills context with stale data; behaviour becomes unstable.
Agent fabricates "successful" tool output without actually calling the tool.
Humans rubber-stamp every "are you sure?" prompt. The gate becomes ceremonial, not protective.
Sensitive prompts/data sent to providers without retention review.
Observability tools capture full prompts and tool inputs. Forgotten log retention often = compliance breach waiting to happen.
Read-only access, allow-list senders, never let it auto-send.
Force pull-request flow. Never grant force-push or admin.
Use a container or VM. Never grant raw shell on host machine.
Pass scoped credentials; agent must never see cross-tenant data.
Most "AI safety" content is either enterprise-level (governance frameworks, compliance) or academic (alignment, interpretability research). Both important — but neither helps a developer asking "is it safe to let this agent touch my email?"
This pillar fills that gap, in plain English, with examples: